Built-in processors
Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them –
Identity(): This is the default, and, simplest processor. It never changes any value. It can be used, as an input, as well as, output processor. This means, when no other processor, is mentioned, this acts, and, returns the values unchanged.
Python3
# Import the processor from itemloaders.processors import Identity # Create object of Identity processor proc = Identity() # Assign values and print result print (proc([ 'star' , 'moon' , 'galaxy' ])) |
Output:
['star','moon','galaxy']
TakeFirst(): This returns, the first non-null, or, non-empty value, from the data received. It is usually, used as an output processor.
Python3
# import the processor module from itemloaders.processors import TakeFirst # Create object of TakeFirst processor proc = TakeFirst() # assign values and print the result print (proc([' ', ' star ',' moon ',' galaxy'])) |
Output:
'star'
Compose(): This takes data, and, passes it to the function, present in the argument. If more than one function, is present in the argument, then the result of the previous, is passed to the next. This continues, till the last function, is executed, and, the output is received.
Python3
# Import the processor module from itemloaders.processors import Compose # Create an object of Compose processor and pass values proc = Compose( lambda v: v[ 0 ], str .upper) # Assign values and print result print (proc([ 'hi' , 'there' ])) |
Output:
HI
MapCompose(): This processor, works similarly to Compose. It can have, more than one function, in the argument. Here, the input values are iterated, and, the first function, is applied to all of them, resulting in a new iterable. This new iterable is now passed to the second function, in argument, and so on. This is mainly used, as an input processor.
Python3
# Import MapCompose processor from itemloaders.processors import MapCompose # custom function to filter star def filter_star(x): # return None if 'star' is present return None if x = = 'star' else x # Assign the functions to MapCompose proc = MapCompose(filter_star, str .upper) # pass arguments and print result print (proc([ 'twinkle' , 'little' , 'star' , 'wonder' , 'they' ])) |
Output:
['TWINKLE', 'LITTLE', 'WONDER', 'THEY']
Join(): This processor, returns the values joined together. To put an expression, between each item, one can use a separator, the default one is ‘u’. In the example below, we have used <a> as a separator:
Python3
# Import required processors from itemloaders.processors import Join # Put separator <br> while creating Join() object proc = Join( '<a>' ) # pass the values and print result print (proc([ 'Sky' , 'Moon' , 'Stars' ])) |
Output:
'Sky<a>Moon<a>Stars'
SelectJmes(): This processor, using the JSON path given, queries the value and returns the output.
Python3
# Import the class from itemloaders.processors import SelectJmes # prepare object of SelectJmes proc = SelectJmes( "hello" ) # Print the output of json path print (proc({ 'hello' : 'scrapy' })) |
Output:
scrapy
In this example, we have used TakeFirst() and MapCompose() processors. The processors, act on the scraped data, when Item loader functions, like add_xpath() and others, are executed. The most commonly used, loader functions are –
- add_xpath() – This method, takes the item field, and, corresponding XPath expression for it. It mainly accepts parameters as –
- field_name – The item field name, defined in the ‘items.py’ class.
- XPath- The XPath expression, used to navigate to the tag.
- processors – input processor name. If any processor, is not defined, then, default one is called.
- add_css() – This method, takes the item field, and, corresponding CSS expression for it. It mainly accepts parameters as –
- field_name – The item field name, defined in the ‘items.py’ class.
- CSS- The CSS expression, used to navigate to the tag.
- processors – input processor name. If any processor, is not defined, then the default one is called.
- add_value() – This method, takes string literal, and, its value. It accepts parameters as –
- field_name- any string literal.
- value – The value of the string literal.
- processors – input processor name. If any processor, is not defined, then the default one is called.
One can make use, of any of the above loader methods. In this article, we have used XPath expressions, to scrape data, hence the add_xpath() method, of the loader is used. In the Scrapy configuration, the processors.py file, is present, from which we can import, all mentioned processors.
Scrapy – Item Loaders
In this article, we are going to discuss Item Loaders in Scrapy.
Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.