Best 3 Ways to Crawl Data from a WebsiteFriday, May 29, 2020
The need for crawling web data has become larger in the past few years. The data crawled can be used for evaluation or prediction in different fields. Here, I’d like to talk about 3 methods we can adopt to crawl data from a website.
1. Use Website APIs
Many large social media websites, like Facebook, Twitter, Instagram, StackOverflow provide APIs for users to access their data. Sometimes, you can choose the official APIs to get structured data. As the Facebook Graph API shows below, you need to choose fields you make the query, then order data, do the URL Lookup, make requests and etc. To learn more, you can refer to https://developers.facebook.com/docs/graph-api/using-graph-api.
2. Build your own crawler
However, not all websites provide users with APIs. Certain websites refuse to provide any public APIs because of technical limit or other reasons. Someone may propose RSS feeds, but because they put a limit on their use, I will not suggest or make further comments on it. In this case, what I want to discuss is that we can build a crawler on our own to deal with this situation.
How does a crawler work? A crawler, put it another way, is a method to generate a list of URLs that you can feed through your extractor. The crawlers can be defined as tools to find the URLs. You first give the crawler a webpage to start, and they will follow all these links on that page. Then this process will keep going on in a loop.
Then, we can proceed with building our own crawler. It’s known that Python is an open-source programming language, and you can find many useful functional libraries. Here, I suggest the BeautifulSoup (Python Library) for the reason that it is easier to work with and possesses many intuitive characters. More exactly, I will utilize two Python modules to crawl the data.
BeautifulSoup does not fetch the web page for us. That’s why I use urllib2 to combine with the BeautifulSoup library. Then, we need to deal with HTML tags to find all the links within page’s <a> tags and the right table. After that, iterate through each row (tr) and then assign each element of tr (td) to a variable and append it to a list. Let’s first look at the HTML structure of the table (I am not going to extract information for table heading <th>).
By taking this approach, your crawler is customized. It can deal with certain difficulties met in the API extraction. You can use the proxy to prevent it from being blocked by some websites and etc. The whole process is within your control. This method should make sense for people with coding skills. The data frame you crawled should be like the figure below.
3. Take advantage of ready-to-use crawler tools
However, to crawl a website on your own by programming may be time-consuming. For people without any coding skills, this would be a hard task. Therefore, I'd like to introduce some crawler tools.
Octoparse is a powerful visual windows-based web data crawler. It is really easy for users to grasp this tool with its simple and friendly user interface. To use it, you need to download this application on your local desktop.
As the figure shown below, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Octoparse provides two editions of crawling service subscription plans - the Free Edition and Paid Edition. Both can satisfy the basic scraping or crawling needs of users. With the Free Edition, you can run your tasks on the local side.
If you switch your free edition to a Paid Edition, you can use the Cloud-based service by uploading your tasks to the Cloud Platform. 6 to 14 cloud servers will run your tasks simultaneously with a higher speed and crawl in a larger scale. Plus, you can automate your data extraction leaving without a trace using Octoparse’s anonymous proxy feature that could rotate tons of IPs, which will prevent you from being blocked by certain websites. Here's a video introducing Octoparse Cloud Extraction.
Octoparse also provides API to connect your system to your scraped data in real-time. You can either import the Octoparse data into your own database or use the API to require access to your account’s data. After you finish the configuration of the task, you can export data into various formats, like CSV, Excel, HTML, TXT, and database (MySQL, SQL Server, and Oracle).
Import.io is also known as a web crawler covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. It suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. When you consider that all this comes with a free-for-life price tag and an awesome support team, import.io is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise-level option for companies looking for more large scale or complex data extraction.
Mozenda is another user-friendly web data extractor. It has a point-and-click UI for users without any coding skills to use. Mozenda also takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus, it allows advanced programming using REST API the user can connect directly with Mozenda account. It provides the Cloud-based service and rotation of IPs as well.
SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxies and RSS submission. By using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.
If people just want to scrape data in a simple way, I suggest you choose the Google Web Scraper Plugin. It is a browser-based web scraper that works like Firefox's Outwit Hub. You can download it as an extension and have it installed in your browser. You need to highlight the data fields you’d like to crawl, right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs. The latest version still had some bugs on spreadsheets. Even though it is easy to handle, notice to all users, it can’t scrape images and crawl data in a large amount.
Author: The Octoparse Team