Initializing Directory and Setting Up Project

Let’s first create a scrapy project. For that make sure that Python and PIP is installed in the system. Then run the below-given commands one-by-one to create a scrapy project similar to the one which we will be using in this article.

  • Let’s first create a virtual environment in a folder named GFGScrapy and activate that virtual environment there.
# To create a folder named GFGScrapy
mkdir GFGScrapy
cd GFGScrapy

# making virtual env there
virtualenv
cd scripts  

# activating it
activate
cd..

Hence, after running all these commands we will get the output as shown:

  • Now it’s time to create a scrapy project. For that Make sure that scrapy is installed in the system or not. If not installed install it using the below-given command.

Syntax:

pip install scrapy

Now to create a scrapy project use the below-given command and also create a spider.

scrapy startproject scrapytutorial  //project name is scrapytutorial

cd scrapytutorial

scrapy genspider spider_to_crawl https://quotes.toscrape.com/

//The link above mentions the website where we are going to crawl the spider.

Once you have created a scrapy project using pip installer, then the output of the project directory looks like the one given in the image. (Please refer to this if you want to know more about a scrapy project and get familiar with it).

The directory structure consists of the following path (sample)

C://<project-name>/<project-name>

In the above image, the project name is scrapytutorial and it has many files inside it as shown.

The files we are interested in are spider_to_crawl.py file (where we used to describe the methods for our spiders) and pipelines.py file where we will be describing components that will handle our further data processing which is to be done with the scraped data. In simple terms, this file is used to describe the methods which are used for further operations on data. The third most important file is settings.py file where we will be registering our components (created in pipelines,.py file) orderly. The next most important file is items.py file. This file is used to describe the form or dictionary structure in which data will flow from spider_to_crawl to pipelines.py file. Here we will be giving some keys which will be present in each item.

Let’s have a look at our spider_to_crawl.py file present inside our spiders folder. This is the file where we are writing the URL where our spider has to crawl and also a method named as parse() which is used to describe what should be done with the data scraped by the spider.

This file is automatically generated by “scrapy genspider” command used above. The file is named after the spider’s name. Below given is the default file generated.

Note that we made some changes in the above default file i.e. commented out allowed_domains line and also we made some changes in the strat_urls (removed “http://“).

How to Convert Scrapy item to JSON?

Prerequisite: 

Scrapy is a web scraping tool used to collect web data and can also be used to modify and store data in whatever form we want. Whenever data is being scraped by the spider of scrapy, we are converting that raw data to items of scrapy, and then we will pass that item for further processing to pipelines. In pipelines, these items will be converted to JSON data, and we can either print it or can save it in another file. Hence, we can retrieve JSON data out of web scraped data.

Similar Reads

Initializing Directory and Setting Up Project

Let’s first create a scrapy project. For that make sure that Python and PIP is installed in the system. Then run the below-given commands one-by-one to create a scrapy project similar to the one which we will be using in this article....

Converting scrapy to JSON

Pipelines are methods by which we can convert or modify or store items of scraped data. Hence, let’s first talk about some of its components....