Web Crawling | How to Build a Crawler to Extract Web Data
Wednesday, May 03, 2017
The semantic technology will be enough to kick start a new era featuring a smarter and more structured web of data. It's supported more widely by companies like Facebook, Google, Best Buy and etc. Web crawling as a core of data analysis has come a long way from being an emerging technology to become an integral part of many industries. The first crawlers to crawl websites were developed for a much smaller web (about 100,000 web pages), but today some of the popular sites alone have millions of pages. Till now, wealth of info can be crawled and transformed into valuable data sets, then used in different industries. Enterprises explored user data for business analysis, scholars explored data for scientific research. Thus, where and how we can fetch the data, apart from some official public data sets provided by some companies? Actually, we can also build our customized web crawlers to crawl from websites.
What’s Web Crawling
-- Start with Web Crawling
Web crawling refers to extracting specific HTML data from certain websites. Simply put, we can perceive a web crawler as a particular program designed to crawl websites in orientation and glean data. However, we are unable to get the URL address of all web pages within a website containing many web pages in advance. Thus, what concerned us is how to fetch all the HTML web pages from a website.
-- Traverse all the URLs
Normally, we could define an entry page: One web page should contain URLs of other web pages, then we could retrieve these URLs from the current page and add all of these affiliated URLs into the crawling queue. Next, we crawl another page and repeat the same process as the first one recursively. Essentially, we could assume the crawling scheme as depth-search or breadth-traversal. And as long as we could access to the Internet and analyze the web page, we could crawl a website. Fortunately, most programming language offers HTTP client libraries to crawl web pages, an regular expression can even be used for HTML analysis.
-- General Web Crawler Algorithm
- Start with a list of initial URLs, called the seeds.
- Visit these URLs.
- Retrieve required information from the page.
- Identify all the hyper links on the page.
- Adds the links to the queue of URLs, called crawler frontier.
- Recursively visit the URLs from the crawler frontier.
How to crawl a website
-- Two Major Steps to Build a Web Crawler
To build a web crawler, one must-do step is to download the web pages. This is not easy, since many factors should be taken into consideration, like how to better leverage the local band width, how to optimize DNS queries, and how to release the traffic in the server by assigning web requests reasonably.
While there are many things we should be aware when building a web crawler, however, in most cases, we just want to create a crawler for a specific website, rather than to build a general one, like Google crawler. Thus, we’d better do a deep research on the structure of target websites and pick up some valuable links to keep track of, in order to prevent from extra cost on redundant or junk URLs. What’s more, we could try to only crawl what we are interested from the target website by following a predefined sequence if we could find out a proper crawling path concerning about the web structure.
For example, if we’d like to crawl the content from mindhack.cn, and we have found two types of pages that we’re interested:
1. Article List, such as the main page, or the URL with /page/\d+/ and etc.
By inspecting Firebug, we could find out that the link of each article is an “a Tag” under h1.
2. Article Content, such as /2008/09/11/machine-learning-and-ai-resources/, which has included complete article content.
Thus, we could start with the main page, and retrieve other links from the entry page - wp-pagenavi. Specifically, we need to define a path: We only follow next page, by which mean we could traverse all the pages from start to end, and be set free from a repetitive judgment. Then, the concrete article links within the list page will be the URLs we’d like to store.
-- Some Tips for Crawling
- Crawl Depth - How many clicks from the entry page you want the crawler traverse.
In most cases, a depth of 5 is enough for crawling from most websites.
- Distributed Crawling - The crawler will attempt to crawl the pages at the same time.
- Pause - The length of time the crawler pause before crawling the next page.
The faster you set the crawler, the harder it will be on the server. At least 5-10 seconds between page clicks.
- URL template - The template will determine which pages the crawler wants data from.
- Save log - A saved log will store which URLs were visited and which were converted into data.
It is used for debugging and prevent from crawling a visited site repeatedly.
Finding a Tool to Crawl Data
Significant tactical challenges exist today in the world coming along with web crawling:
- IP address blocking by target websites
- Non-uniform or irregular web structures
- AJAX loaded content
- Real time latency
- Anti-Crawling aggressive websites
To tackle with all of the issues is not an easy task, it may be even troublesome and could possibly time costly. Fortunately, now you needn’t crawl a website the way it used to be and get stuck in a technical problem, since a new method to crawl data from target websites is proposed alternatively. Users will not be required to deal with complex configurations or coding to build a crawler by themselves, instead they could concentrate more on data analysis in their respective business domains.
This method I’d mention is an automated web crawler - Octoparse, which can make crawling available for any one. Users could use the built-in tools and APIs to crawl data using a user-friendly point&click UI. Many other extensions are offered in this application to deal with the rising issues within a few configuring steps, troubles could be wrestled with in a much efficient way with its powerful utilities including:
- IP proxy servers to prevent from IP blocking
- Built in Regex Tool to re-format data fields
- AJAX setting to load dynamic content
- Cloud service to split task and speed up extraction, and etc.
To learn more about this web crawler software, you may check out the video below to learn how to get started with Octoparse and begin with your crawling.
More resources for web crawling study applied in different cases:
Check out some features you may feel interested in:
Or read more about how a web scraper is applied in various industries:
Most popular posts
- Related articles
- Two Fastest Ways for Startups to Build Your E...
- Web Scraping for Lead Generation
- Web Scraping for Sports Stats
- Top 5 Web Scraping Tools Comparison
- Octoparse vs. Scrapinghub(Portia) - Which is ...