How to avoid getting your IP address banned?

Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively.

Python3

from random import randint 
from time import sleep  
  
print(randint(1,10))

Output

The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.

Python3

from time import *
from random import randint 
  
for i in range(0,3): 
  # selects random integer in given range 
  x = randint(2,5) 
  print(x) 
  sleep(x) 
  print(f'I waited {x} seconds')

Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

To get you a clear idea of this function in action, refer to the code given below.

Python3

import requests 
from bs4 import BeautifulSoup as bs 
from random import randint 
from time import sleep 
  
URL = 'https://www.w3wiki.org/page/1/'
  
for page in range(1,10):  
      # pls note that the total number of 
    # pages in the website is more than 5000 so i'm only taking the 
    # first 10 as this is just an example 
  
    req = requests.get(URL + str(page) + '/') 
    soup = bs(req.text, 'html.parser') 
  
    titles = soup.find_all('div',attrs={'class','head'}) 
  
    for i in range(4,19): 
        if page>1: 
            print(f"{(i-3)+page*15}" + titles[i].text) 
        else: 
            print(f"{i-3}" + titles[i].text) 
  
    sleep(randint(2,10))

Output:

The program has paused its execution and is waiting to resume

The output of the above code

How to Scrape Multiple Pages of a Website Using Python?

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.

Let’s dive deep and scrape a website. In this article, we are going to take the w3wiki website and extract the titles of all the articles available on the Homepage using a Python script.

If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any!

w3wiki Homepage

How to avoid getting your IP address banned?

Python3

Python3

Python3

How to Scrape Multiple Pages of a Website Using Python?

Categories

Contact US

How to avoid getting your IP address banned?

Python3

Python3

Python3

How to Scrape Multiple Pages of a Website Using Python?

Similar Reads

Categories

Contact US