Data Scraping is such a thing that should be done responsibly. When you scrape any website, you must be very cautious. It might have negative effects on the website. Several free web scrapers that are available in the market can scrape any website smoothly without even getting blocked. On the web, several sites do not follow anti-scraping mechanisms. However, a few websites block scrapers as they don’t believe in open data access. Remember that you should follow the scraping policies of the site. Are you building web scrapers for your project? Then, you should follow a few essential Tips To Avoid Getting Blocked While Scraping Websites that we have mentioned in this article, before you start scraping any website.
What is Web Scraping?
Web Scraping is a method through which you can scrap or extract data from a website using a web browser or the HTTP protocol. You should know that the method can be automated using a web crawler or a bot, or it can be manual. Besides, a misconception about web scraping is that it is illegal. But the truth is that it is legal unless you try to access non-public data which are not reachable to the public, such as login credentials.
If you are scraping through small sites, you may not face any problem. But if you try web scraping on any large website or on Google, you can find that your requests are getting ignored or your IP is getting blocked.
How Do Websites Detect Web Crawlers?
Web pages check a few crucial things, including IP addresses, user agents, browser parameters, and general behavior. Thus, the pages identify web crawlers & web scraping tools. When the site finds this suspicious, you will receive CAPTCHAs. Once your crawler is detected, your requests will be blocked.
Websites are able to implement multiple protocols in order to detect scrapers. These are a few protocols, including:
- Monitoring traffic like a lot of product views without purchases.
- Monitoring competitors
- Monitoring users with high activity levels.
- Setting up honeypots
- Using behavioral analysis
- Checking the user’s browser & PC.
How To Make Your Scraper Look Human?
To make your scraper look human, you should try to mimic human browsing behavior. These are a few ways through which humans typically use the web:
- They don’t break the rules.
- They don’t surf too quickly.
- They don’t interact with the web page’s invisible patterns.
- They use browsers.
- They don’t have only a single pattern of browsing.
- They can solve captchas.
- They save cookies and use them.
Now, let’s learn how it is possible to crawl a website without getting blocked.
Tips To Avoid Getting Blocked While Scraping Websites:
-
Check Robots Exclusion Protocol:
Before you scrap or crawl any website, you have to ensure that your target allows you to collect data from their page. You need to inspect robots.txt or the robots exclusion protocol file. Alongside this, you need to respect the website’s rules.
You should be respectful even while the web page enables crawling, and you should not harm the page. In addition, you need to follow the rules that are outlined in the robot exclusion protocol. You can crawl during off-peak hours, limit requests that are coming from an IP address, & set a delay between them. Even if the website lets you perform web scraping, you can get blocked. Let’s check other tips to avoid getting blocked while scraping websites.
-
Use A Proxy Server:
Web crawling is almost next to impossible without proxies. You need to choose a proxy service provider, which is reliable. And then, you have to select between the data center & residential IP proxies based on your task.
When you use an intermediary between the device & the target website, it will reduce IP address blocks along with ensuring anonymity. It also lets you access websites that may not be available in your region. For instance, people based in Germany can use a US proxy to access web content in the US.
So, you need to choose a proxy provider that has a large pool of IPs. Here, you can see a small Python code snippet that you can use for generating multiple new IP addresses before making a request.
from bs4 import technology news
import requests
l={}
u=list()
url=”https://www.proxydotcom/proxy-server-list/country-“+country_code+”/”
respo = requests.get(URL).text
soup = technology news respo,’html.parser’)
allproxy = news.find_all(“tr”)
for proxy in allproxy:
foo = proxy.find_all(“td”)
try:
l[“ip”]=foo[0].text.replace(“\n”,””).replace(“document.write(“,””).replace(“)”,””).replace(“\’”,””).replace(“;”,””)
except:
l[“ip”]=None
try:
l[“port”]=foo[1].text.replace(“\n”,””).replace(“ “,””)
except:
l[“port”]=None
try:
l[“country”]=foo[5].text.replace(“\n”,””).replace(“ “,””)
except:
l[“country”]=None
if(l[“port”] is not None):
u.append(l)
l={}
print(u)
-
IP Rotation:
Websites examine the IP addresses of the web scrapers to detect them & track their behavior. Once the server finds any strange behavior, pattern, or an impossible request frequency for a real user, the server will block the IP address from accessing the website.
Using an IP rotation service is essential to prevent sending all requests via the same IP address. With the help of these, you can route your requests via a proxy pool to hide the actual IP address while scraping the site. Thus, you can scrape most sites without any problem.
But remember that all proxies are not made equal. Several sites use more advanced proxy anti-scraping mechanisms. In those cases, you should use residential proxies. Using IP rotation, your scraper is able to make the requests appear to be from different users. Besides, the scraper is able to mimic the normal behavior of online traffic.
Let’s have a look at the proxies which are used here:
IP: 180.179.98.22
Port: 3128
Program:
import requests
# use to parse html text
from lxml.html import fromstring
from itertools import cycle
import traceback
def to_get_proxies():
# website to get free proxies
url = ‘https://free-proxy-listdotnet/’
response = requests.get(url)
parser = fromstring(response.text)
# using a set to avoid duplicate IP entries.
proxies = set()
for i in parser.xpath(‘//tbody/tr’)[:10]:
# to check if the corresponding IP is of type HTTPS
if i.xpath(‘.//td[7][contains(text(),”yes”)]’):
# Grabbing IP and corresponding PORT
proxy = “:”.join([i.xpath(‘.//td[1]/text()’)[0],
i.xpath(‘.//td[2]/text()’)[0]])
proxies.add(proxy)
return proxies
Output:
proxies={‘160.16.77.108:3128’, ‘20.195.17.90:3128’, ‘14.225.5.68:80’, ‘158.46.127.222:52574’, ‘159.192.130.233:8080’, ‘124.106.224.5:8080’, ‘51.79.157.202:443’, ‘161.202.226.194:80’}
Now, let’s rotate the IP using the round-robin process.
Program:
proxies = to_get_proxies()
# to rotate through the list of IPs
proxyPool = cycle(proxies)
# insert the url of the website you want to scrape.
url = ”
for i in range(1, 11):
# Get a proxy from the pool
proxy = next(proxyPool)
print(“Request #%d” % i)
try:
response = requests.get(url, proxies={“http”: proxy, “https”: proxy})
print(response.json())
except:
# One has to try the entire process as most
# free proxies will get connection errors
# We will just skip retries.
print(“Skipping. Connection error”)
Output:
Request #1
Skipping. Connection error
Request #2
Skipping. Connection error
Request #3
Skipping. Connection error
Request #4
Skipping. Connection error
Request #5
Skipping. Connection error
Request #6
Skipping. Connection error
Request #7
Skipping. Connection error
Request #8
Skipping. Connection error
Request #9
Skipping. Connection error
Request #10
Skipping. Connection error
-
Set A Real User Agent:
User-Agents are basically a special kind of HTTP header that is capable of telling the visited website the exact browser you are using. A few websites can examine user agents and block requests from them that do not belong to major browsers. As most web scrapers do not bother setting the User-Agent, they are simple to detect.
You need to set a popular User Agent for the web crawler. If you are an advanced user, then set your User-Agent to the Googlebot User Agent. The reason is that most of the websites want to be available on the list of Google and, as a result, allow Googlebot to access its content.
Remember that you need to keep the User Agents you use up to date. Updates to different browsers, like Safari, Chrome, Firefox, etc., have different user agents. Therefore, if you don’t change the user agent on your crawlers for years, these will become more suspicious. If you want, you can rotate between a number of several user agents to avoid a sudden spike in requests from an exact user agent to a site.
-
Set Your Fingerprint Right:
Nowadays, anti-scraping mechanisms have become more sophisticated. In order to detect bots, a few websites use TCP or IP fingerprinting. TCP leaves different parameters when you scrape the web. The end user’s OS or the device sets the parameters. If you wonder how to avoid being blacklisted while scraping, ensure that the parameters are consistent. Otherwise, Web Unblocker which is a proxy solution powered by artificial intelligence — can be used. It comes with a dynamic fingerprinting functionality. This one puts several fingerprinting variables together in such a way that even when it sets up a single best-working fingerprint, the fingerprints will remain seemingly random and will be able to successfully pass anti-bot checks.
-
Beware Of Honeypot Traps:
There are many sites that put invisible links (which a robot would only follow) to identify web crawlers. If you are willing to avoid them, you should detect whether a link has a “display: none” or “visibility: hidden” CSS property set. If these have, then you should not follow the link. Otherwise, the server might spot your scraper, fingerprint the properties of your requests, and then easily block you. In order to detect crawlers, smart webmasters mostly use honeypots. Therefore, you have to ensure that this check is performed on every page you are scraping.
Whereas advanced webmasters sometimes set the color to white or the color that the page’s background has. So, you might need to check if the link has “color: #fff;” or “color: #ffffff” set, as it can make the link invisible.
-
Use CAPTCHA-solving Services:
Displaying a captcha is a very common way for sites to crack down on crawlers. Fortunately, a few services have such designs, allowing them to get past these restrictions in an economical way. These can be fully integrated solutions or narrow CAPTCHA-solving solutions, which are possible to be integrated for the captcha-solving functionality.
There might not be any way for simple data-collecting scripts to solve them. Therefore, these can easily block most scrapers. A few captcha-solving services are slow as well as costly. Therefore, you should consider if it has the capability to scrape sites that need continuous CAPTCHA-solving over time.
-
Change The Crawling Pattern:
The pattern indicates the way your crawler is configured to navigate the website. In case you use the basic crawling pattern constantly, you can be blocked anytime. In order to make crawling less predictable, you should include random clicks, scrolls, and mouse movements. But this behavior must not be completely random. When you develop a crawling pattern, the best practice is to think about how normal users browse the website. After that, you have to apply the principles to the tool itself.
For instance, you can visit the home page and then make requests to the inner pages.
-
Reduce The Scraping Speed:
Slowing down the scraper speed is essential to mitigate the risk of being blocked. For example, before you perform any particular action, adding random breaks between requests or initiating wait commands is possible.
-
Crawl During Off-peak Hours:
Most times, crawlers move through pages faster than normal users, and the reason is that they don’t have to read the content. In this way, an unrestrained web crawling tool affects server load more compared to an average internet user. If you crawl during high-load times, it can impact user experience negatively because of service slowdowns. Always remember that if you want to find the best time to crawl the website, it will depend on a case-by-case basis. So, try to choose off-peak hours after midnight.
-
Avoid Image Scraping:
Images are actually data-heavy objects that can also be copyright-protected. While it takes extra storage space & bandwidth, there also exists a higher risk of infringing on the rights of someone else.
As images are data-heavy, these remain hidden in JavaScript elements that are capable of boosting the complexity of the data acquisition process. Additionally, these can slow down the web scraper itself. It is necessary to write, and employ a more complicated scraping process in order to get pictures out of JS elements.
-
Avoid JavaScript:
It is difficult to acquire data nested in JavaScript elements. In order to display content depending on particular user actions, websites use several JavaScript features. Displaying product images only in search bars is a common practice as soon as the user is given some input. JavaScript causes several other problems like memory leaks, application instability, or crashes at times. Dynamic features can also become a burden. You should avoid JavaScript unless it becomes absolutely necessary.
-
Use A Headless Browser:
A headless browser can be used for block-free web scraping. Although it can work like other browsers, it does not come with a GUI or graphical user interface. Additionally, it allows scraping content loaded by rendering JavaScript elements. Generally, you can find headless modes in web browsers like Firefox and Chrome.
-
Set Other Request Headers:
Real web browsers come with a whole host of headers set. Careful websites can check any of these to block your web scraper. If you want to make your scraper appear to be a real browser, you need to go to https://httpbin.org/anything. Then, you need to copy the headers which you can see. Hence, the headers will be those ones that your current web browser is using.
Headers example from HTTPbin:
There are several things like “Accept,” “Accept-Encoding,” “Accept-Language,” and “Upgrade-Insecure-Requests,” which are set in a way that your requests appear as if they belong to a real browser that will help to prevent you from getting your web scraping blocked. If you rotate through a series of IP addresses and set proper HTTP request headers, you can easily prevent yourself from being identified by most websites.
-
Set Random Intervals In Between Your Requests:
Detecting a web scraper that sends one request every second is simple. Any real person wouldn’t use a website like that. Moreover, this type of pattern can be detected easily. You need to use randomized delays, like between 2-10 seconds, so that you can build a web scraper that will be able to avoid being blocked.
You should be polite also. When you send requests quickly, the website can crash. Once you find your requests getting slower, you will want to send requests more slowly. Therefore, the web server will not be overloaded.
Ensure that you use these best practices of web scraping to avoid these kinds of problems.
You need to check the robots.txt of a site for polite crawlers. The location of robots.txt will be at https://prodigitalweb.com/robots.txt or https://www.prodigitalweb.com/robots.txt. You should check if there is any line saying crawl-delay, which indicates the number of seconds you need to wait in between requests.
-
Set A Referrer:
This one is actually an HTTP request header using which the site gets to know from which site you arrived. You can use this idea to set it and make it look as if you are arriving from Google.
Doing this is possible with the header:
“Referer”: “https://www.google.com/”
Additionally, you are able to change it up for websites in different nations. For instance, if you want to scrape a site in the United Kingdom, you would need to use “https://www.google.co.uk/” instead of “https://www.google.com/.”
-
Detect Website Changes:
Several websites change the layouts for different reasons. It can cause scrapers to break. A few websites could have various layouts in such places that you can’t expect. It is true even for big companies which are less tech-savvy, e.g., big retail stores that are making the transition to online. When you build the scraper, you have to identify the changes and generate ongoing monitoring to know your crawler is still working. Besides, You can count the number of successful requests in each crawl.
-
Scrape Out Of The Google Cache:
You can scrape data out of the website’s cached copy of Google instead of the website itself for data that doesn’t change so often. You only have to prepend “http://webcache.googleusercontent.com/search?q=cache:” to the beginning of the URL. For instance, in order to scrape the documentation of ScraperAPI, you should send your request to:
“http://webcache.googleusercontent.com/search?q=cache:https://www.example.com/documentation/.”
For non-time-sensitive information that is on hard-to-scrape sites, it is a good workaround. Scraping out of Google’s cache is more trustworthy than scraping a site that tries to block your scrapers. Always keep in mind that it’s not a foolproof solution.
There are a few websites (like LinkedIn) that tell Google not to cache their data. Besides, the data for sites that are not popular can be outdated, because Google can determine how often they are able to crawl a website depending on the site’s popularity and the number of pages available on the website.
-
Use APIs To Your Advantage:
In recent times, most information displayed by websites comes from APIs. Scraping the data is hard as it is dynamically requested with JavaScript after you have executed some action.
Suppose you want to collect data from posts appearing on a website with “infinite scroll.” Hence, static web scraping will not be that much effective as you get the outcomes from the first page.
In order to configure user actions for the websites, you may use headless browsers or a scraping service. As an alternative way, you are able to reverse engineer the site’s APIs.
- First, you need to use the network inspector of the browser you prefer. Then, you need to check the XHR (XMLHttpRequest) requests that the page is making.
- After that, you have to check the parameters like page numbers, dates, or reference IDs. In a few cases, the parameters use simple encodings in terms of preventing APIs from being used by 3rd parties. Hence, you get to know how exactly you should send the correct parameters with trial & error.
- In other cases, you should obtain authentication parameters with real users & browsers. After that, you can send the information as cookies or headers to the server. You should carefully study the requests made by the website to its API.
- Remember that sometimes, it can be a difficult task to figure out the workings of a private API. It is very simple to do the parsing job because you will get organized and structured information, generally in JSON format.
-
Stop Repeated Failed Attempts:
The most suspicious situation for a webmaster is to view multiple failed requests. Initially, they might not suspect a bot is the cause, and begin investigating. But, if they identify the errors because a bot attempts to scrape the data, they will block the web scraper. It is the reason you should detect & log failed attempts as well as get notification when it happens to suspend scraping. The errors can happen due to the changes to the website. Before you continue with data scraping, you have to adjust the scraper to accommodate the new website structure. Thus, you are able to avoid triggering alarms, which may result in being blocked.
The Bottom Line:
A few websites use many mechanisms to block and prevent you from scraping their content. It is not enough to use one technique to avoid getting blocked if you want successful scraping. You should set browser parameters right, take care of the fingerprinting, & be conscious of honeypot traps. Moreover, you can use trustworthy proxies & scrape websites. Thus, all public data-gathering jobs will work smoothly, and you can use fresh information for your business improvement.
Frequently Asked Questions
Can you get IP banned for web scraping?
Web scraping needs to consider behavior and balance efficiency. If someone overloads a server with a barrage of requests in a short period, it can result in IP bans. You need to introduce a delay between requests that can mimic human user behavior & maintain server friendliness.
Why can’t you scrape some websites?
Web scraping may be challenging in a few common scenarios. Websites implementing a captcha or similar anti-bot measures can make scraping difficult.
Does Google ban scraping?
We consider Google search results as publicly available data. Therefore, it is allowed to scrape them. But a few types of data can not be scraped, like personal information, copyrighted content, and so on. Therefore, you should consult a legal professional beforehand.