How to avoid getting your IP address banned?

Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively. 

Python3




from random import randint
from time import sleep 
  
print(randint(1,10))


Output

1

The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.

Python3




from time import *
from random import randint
  
for i in range(0,3):
  # selects random integer in given range
  x = randint(2,5)
  print(x)
  sleep(x)
  print(f'I waited {x} seconds')


Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

To get you a clear idea of this function in action, refer to the code given below.

Python3




import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep
  
  
for page in range(1,10): 
      # pls note that the total number of
    # pages in the website is more than 5000 so i'm only taking the
    # first 10 as this is just an example
  
    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
  
    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)
  
    sleep(randint(2,10))


Output:

The program has paused its execution and is waiting to resume

The output of the above code



How to Scrape Multiple Pages of a Website Using Python?

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.

Let’s dive deep and scrape a website. In this article, we are going to take the w3wiki website and extract the titles of all the articles available on the Homepage using a Python script. 

If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any! 

w3wiki Homepage

Similar Reads

Scraping multiple Pages of a website Using Python

Now, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation. Duh!...

How to avoid getting your IP address banned?

...