How to avoid getting your IP address banned?
Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!
Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively.
Python3
from random import randint from time import sleep print (randint( 1 , 10 )) |
1
The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.
Python3
from time import * from random import randint for i in range ( 0 , 3 ): # selects random integer in given range x = randint( 2 , 5 ) print (x) sleep(x) print (f 'I waited {x} seconds' ) |
5 I waited 5 seconds 4 I waited 4 seconds 5 I waited 5 seconds
To get you a clear idea of this function in action, refer to the code given below.
Python3
import requests from bs4 import BeautifulSoup as bs from random import randint from time import sleep for page in range ( 1 , 10 ): # pls note that the total number of # pages in the website is more than 5000 so i'm only taking the # first 10 as this is just an example req = requests.get(URL + str (page) + '/' ) soup = bs(req.text, 'html.parser' ) titles = soup.find_all( 'div' ,attrs = { 'class' , 'head' }) for i in range ( 4 , 19 ): if page> 1 : print (f "{(i-3)+page*15}" + titles[i].text) else : print (f "{i-3}" + titles[i].text) sleep(randint( 2 , 10 )) |
Output:
How to Scrape Multiple Pages of a Website Using Python?
Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.
Let’s dive deep and scrape a website. In this article, we are going to take the w3wiki website and extract the titles of all the articles available on the Homepage using a Python script.
If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any!