Built-in processors

Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them –

Identity(): This is the default, and, simplest processor. It never changes any value. It can be used, as an input, as well as, output processor. This means, when no other processor, is mentioned, this acts, and, returns the values unchanged.

Python3




# Import the processor
from itemloaders.processors import Identity
 
# Create object of Identity processor
proc = Identity()
 
# Assign values and print result
print(proc(['star','moon','galaxy']))


Output:

['star','moon','galaxy']

TakeFirst(): This returns, the first non-null, or, non-empty value, from the data received. It is usually, used as an output processor.

Python3




# import the processor module
from itemloaders.processors import TakeFirst
 
# Create object of TakeFirst processor
proc = TakeFirst()
 
# assign values and print the result
print(proc(['', 'star','moon','galaxy']))


Output:

'star'

Compose(): This takes data, and, passes it to the function, present in the argument. If more than one function, is present in the argument, then the result of the previous, is passed to the next. This continues, till the last function, is executed, and, the output is received.

Python3




# Import the processor module
from itemloaders.processors import Compose
 
# Create an object of Compose processor and pass values
proc = Compose(lambda v: v[0], str.upper)
 
# Assign values and print result
print(proc(['hi', 'there']))


Output:

HI

MapCompose(): This processor, works similarly to Compose. It can have, more than one function, in the argument. Here, the input values are iterated, and, the first function, is applied to all of them, resulting in a new iterable. This new iterable is now passed to the second function, in argument, and so on. This is mainly used, as an input processor. 

Python3




# Import MapCompose processor
from itemloaders.processors import MapCompose
 
# custom function to filter star
def filter_star(x):
     
    # return None if 'star' is present
    return None if x == 'star' else x
 
# Assign the functions to MapCompose
proc = MapCompose(filter_star, str.upper)
 
# pass arguments and print result
print(proc(['twinkle', 'little', 'star','wonder', 'they']))


Output:

['TWINKLE', 'LITTLE', 'WONDER', 'THEY']

Join(): This processor, returns the values joined together. To put an expression, between each item, one can use a separator, the default one is ‘u’. In the example below, we have used <a> as a separator:

Python3




# Import required processors
from itemloaders.processors import Join
 
# Put separator <br> while creating Join() object
proc = Join('<a>')
 
# pass the values and print result
print(proc(['Sky', 'Moon','Stars']))


Output:

'Sky<a>Moon<a>Stars'

SelectJmes(): This processor, using the JSON path given, queries the value and returns the output.

Python3




# Import the class
from itemloaders.processors import SelectJmes
 
# prepare object of SelectJmes
proc = SelectJmes("hello")
 
# Print the output of json path
print(proc({'hello': 'scrapy'}))


Output:

scrapy

In this example, we have used TakeFirst() and MapCompose() processors. The processors, act on the scraped data, when Item loader functions, like add_xpath() and others, are executed. The most commonly used, loader functions are –

  • add_xpath() – This method, takes the item field, and, corresponding XPath expression for it. It mainly accepts parameters as –
    • field_name – The item field name, defined in the ‘items.py’ class.
    • XPath- The XPath expression, used to navigate to the tag.
    • processors – input processor name. If any processor, is not defined, then, default one is called.
  • add_css() – This method, takes the item field, and, corresponding CSS expression for it. It mainly accepts parameters as –
    • field_name – The item field name, defined in the ‘items.py’ class.
    • CSS- The CSS expression, used to navigate to the tag.
    • processors – input processor name. If any processor, is not defined, then the default one is called.
  • add_value() – This method, takes string literal, and, its value. It accepts parameters as –
    • field_name- any string literal.
    • value – The value of the string literal.
    • processors – input processor name. If any processor, is not defined, then the default one is called.

One can make use, of any of the above loader methods. In this article, we have used XPath expressions, to scrape data, hence the add_xpath() method, of the loader is used. In the Scrapy configuration, the processors.py file, is present, from which we can import, all mentioned processors.

Scrapy – Item Loaders

In this article, we are going to discuss Item Loaders in Scrapy.

Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields.  In this article, we will learn about Item Loaders.

Similar Reads

Installing Scrapy:

Scrapy, requires a Python version, of 3.6 and above. Install it, using the pip  command, at the terminal as:...

Create a Scrapy Spider Project

Scrapy comes with an efficient command-line tool, called the Scrapy tool. The commands have a different set of arguments, based on their purpose. To write the Spider code, we begin by creating, a Scrapy project. Use the following, ‘startproject’ command, at the terminal –...

Data  Extraction Using Scrapy Items

We will scrape the Book Title, and, Book Price, from the Women’s fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data,  to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same....

Introduction to Item Loaders

...

How do Item Loaders work?

...

Built-in processors:

Item loaders, allow a smoother way, to manage scraped data. Many times, we may need to process, the data we scrape. This processing can be:...

Item Loader Objects

So far we know, Item Loaders are used to parse, the data, before Item fields are populated. Let us understand, how Item Loaders work –...

Following are the methods available for ItemLoader objects:

Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them –...

Nested Loaders

...

Reusing and Extending Item Loaders

...

Declaring Custom Item Loaders Processors

...

Implementing Item Loaders to Parse Data:

...