How to Scrape Websites Without Being Blocked?Friday, July 05, 2019
Web scraping is a technique often employed for automating human browsing behavior for the purpose of retrieving large amounts of data from the web pages efficiently.
While various web scraping tools, like Octoparse, are getting popular around and benefit people substantially in all fields, they come with a price for web owners. A straightforward example is when web scraping overloads a web server and leads to a server breakdown. More and more web owners have equipped their sites with all kinds of anti-scraping techniques to block scrapers, which makes web scraping more difficult. Nevertheless, there are still ways to fight against blocking. In this article, we will talk about 5 tips you can follow to get around blocking.
1. Slow down the scraping
Most web scraping activities aim to fetch data as quickly as possible. However, when a human visits a site, the browsing is going to be much slower compared to what happens with web scraping. Therefore, it is really easy for a site to catch you as a scraper by tracking your access speed. Once it finds you are going through the pages too fast, it will suspect that you are not a human and block you naturally.
So please do not overload the site. You can put some random time delay between requests and reduce concurrent page access to 1-2 pages every time. Learn to treat the website nicely, then you are able to keep scraping it.
In Octoparse, users can set up a wait time for any steps in the workflow to control the scraping speed. There is even a “random” option to make the scraping more human-like.
2. Use proxy servers
When a site detects there are a number of requests from a single IP address, it will easily block the IP address. To avoid sending all of your requests through the same IP address, you can use proxy servers. A proxy server is a server (a computer system or an application) that acts as an intermediary for requests from clients seeking resources from other servers (from Wikipedia: Proxy server). It allows you to send requests to websites using the IP you set up, masking your real IP address.
Of course, if you use a single IP set up in the proxy server, it is still easy to get blocked. You need to create a pool of IP addresses and use them randomly to route your requests through a series of different IP addresses.
Many servers, such as VPNs, can help you to get rotated IP. Octoparse Cloud Service is supported by hundreds of cloud servers, each with a unique IP address. When an extraction task is set to execute in the Cloud, requests are performed on the target website through various IPs, minimizing the chances of being traced. Octoparse local extraction allows users to set up proxies to avoid being blocked.
3. Apply different scraping patterns
Humans browse a site with random clicks or view time; however, web scraping always follows the same crawling pattern as programmed bots follow a specific logic. So anti-scraping mechanisms can easily detect the crawler by identifying the repetitive scraping behaviors performed on a website.
You will need to change your scraping pattern from time to time and incorporate random clicks, mouse movements, or waiting time to make web scraping more human.
In Octoparse, you can easily set up a workflow in 3-5 minutes. You can add clicks and mouse movements easily with drags and points or even rebuild a workflow quickly, saving lots of coding time for programmers and help non-coders to make their own scrapers easily.
4. Switch user-agents
A user-agent(UA) is a string in the header of a request, identifying the browser and operating system to the webserver. Every request made by a web browser contains a user-agent. Using a user-agent for an abnormally large number of requests will lead you to the block.
To get past the block, you should switch user-agent frequency instead of sticking to one.
Many programmers add fake user-agent in the header or manually make a list of user-agents to avoid being blocked. With Octoparse, you can easily enable automatic UA rotation in your crawler to reduce the risk of being blocked.
5. Be careful of honeypot traps
Honeypots are links that are invisible to normal visitors but are there in the HTML code and can be found by web scrapers. They are just like traps to detect scraper by directing them to blank pages. Once a particular visitor browses a honeypot page, the website can be relatively sure it is not a human visitor and starts throttling or blocking all requests from that client.
When building a scraper for a particular site, it is worth looking carefully to check whether there are any links hidden to users using a standard browser.
Octoparse uses XPath for precise capturing or clicking actions, avoiding clicking the faked links (see how to use XPath to locate element here).
All the tips provided in this article can help you avoid getting blocked to some extent. While web scraping tech climbs a foot, the anti-scraping tech climbs ten. Share your ideas with us or if you feel anything can be added to the list.