Web Crawling | How to Build a Crawler to Extract Web DataWednesday, May 03, 2017
Start with Web Crawling
- What is Web Crawling?
Web crawling refers to extracting specific HTML data from certain websites. Simply put, we can perceive a web crawler as a particular program designed to crawl websites in orientation and glean data. However, we are unable to get the URL address of all web pages within a website containing many web pages in advance. Thus, what concerns us is how to fetch all the HTML web pages from a website.
- Traverse All the URLs
Normally, we could define an entry page: One web page contains URLs of other web pages, then we could retrieve these URLs from the current page and add all of these affiliated URLs into the crawling queue. Next, we crawl another page and repeat the same process as the first one recursively. Essentially, we could assume the crawling scheme as depth-search or breadth-traversal. And as long as we could access the Internet and analyze the web page, we could crawl a website. Fortunately, most programming languages offer HTTP client libraries to crawl web pages, and we can even use regular expression for HTML analysis.
- General Web Crawler Algorithm
- Start with a list of initial URLs, called the seeds.
- Visit these URLs.
- Retrieve required information from the page.
- Identify all the hyperlinks on the page.
- Add the links to the queue of URLs, called crawler frontier.
- Recursively visit the URLs from the crawler frontier.
How to Crawl a Website?
- Two Major Steps to Build a Web Crawler
To build a web crawler, one must-do step is to download the web pages. This is not easy since many factors need be taken into consideration, like how to better leverage the local bandwidth, how to optimize DNS queries, and how to release the traffic in the server by assigning web requests reasonably.
While there are many things we should be aware of when building a web crawler, in most cases we just want to create a crawler for a specific website. Thus, we’d better do deep research on the structure of target websites and pick up some valuable links to keep track of, in order to prevent extra cost on redundant or junk URLs. Also, we could try to only crawl what we are interested in from the target website by following a predefined sequence if we could find out a proper crawling path concerning the web structure.
For example, if we’d like to crawl the content from mindhack.cn, and we have found two types of pages that we’re interested:
1. Article List, such as the main page, or the URL with /page/\d+/ and etc.
By inspecting Firebug, we could find out that the link of each article is an “a Tag” under h1.
2. Article Content, such as /2008/09/11/machine-learning-and-ai-resources/, which includes complete article content.
Thus, we could start with the main page, and retrieve other links from the entry page - wp-pagenavi. Specifically, we need to define a path: We only follow the next page, which means we could traverse all the pages from start to end, and be set free from a repetitive judgment. Then, the concrete article links within the list page will be the URLs we’d like to store.
- Some Tips for Crawling
- Crawl Depth - How many clicks from the entry page you want the crawler to traverse. In most cases, a depth of 5 is enough for crawling from most websites.
- Distributed Crawling - The crawler will attempt to crawl the pages at the same time.
- Pause - The length of time the crawler pause before crawling the next page.
- The faster you set the crawler, the harder it will be on the server (At least 5-10 seconds between page clicks).
- URL template - The template will determine which pages the crawler wants data from.
- Save log - A saved log will store which URLs were visited and which were converted into data. It is used for debugging and prevent from crawling a visited site repeatedly.
Finding a Tool to Crawl Data
Significant tactical challenges exist today in the world which comes along with web crawling:
- IP address blocking by target websites
- Non-uniform or irregular web structures
- AJAX loaded content
- Real-time latency
- Anti-Crawling aggressive website
To tackle all the issues is not an easy task, and it may be even troublesome. Fortunately, now you don't need to crawl a website the way it used to be and get stuck in a technical problem. A new method to crawl data from target websites is proposed alternatively. Users will not be required to deal with complex configurations or coding to build a crawler by themselves. Instead, they could concentrate more on data analysis in their respective business domains.
This method I’d mention is an automated web crawler-Octoparse, which makes crawling available for everyone. Users could use the built-in tools and APIs to crawl data with a user-friendly point & click UI. Many other extensions are offered in this application to deal with the rising issues within a few configuring steps. Troubles are wrestled in a more efficient way with its powerful functions including:
- IP proxy servers to prevent from IP blocking
- Built-in Regex Tool to re-format data fields
- AJAX setting to load dynamic content
- Cloud service to split the task and speed up extraction, and etc.
To learn more about this web crawler software, you may check out the video below to learn how to get started with Octoparse and begin your crawling.
More resources for web crawling study applied in different cases:
Check out some features you may be interested in:
Or read more about how a web scraper is applied in various industries:
Artículo en español: Web Crawling | Cómo construir un crawler para Extraer Web Datos
También puede leer artículos de web scraping en el Website Oficial
Author: The Octoparse Team