How to Crawl Data from a WebsiteThursday, February 23, 2017
The need of crawling web data or any other data set resource is becoming increasingly loud in the past few years. The data crawled is meant for evaluation or prediction in different fields for various uses. For example, I have tried crawling Twitter, Facebook and some social media to extract user comments data related with Universities so that I can rank the Universities based on people evaluations and judgement rather than a boring offical report. To begin with, what I need was to fetch the data crawled from different websites. Then, I can process with the data crawled for deep learning by using Sentiment Analysis and assign different weight on their reviews' or comments' emotion tendency. That's interesting! Here, what I’d like to stress is about how we can crawl data from a website. Actually, there are three methods you can adopt to crawl the website, the one which you choose is totally based on your requirements.
There exist several ways you can pick up to crawl data from the web, use of APIs may come first to your mind. Many large social media websites, like Facebook, Twitter, StackOverflow has already provided APIs for users with crawling needs to access to their data. Sometimes, you can choose the official APIs to get the structured data if you don’t want to create any engine or use any crawler tools to crawl the data. As the Facebook Graph API shows below, you need choose fields you make query, then order data, do the URL Lookup, make requests and etc. To learn more, you can refer to https://developers.facebook.com/docs/graph-api/using-graph-api.
However, not all webistes provide users with APIs. Certain websites refuses providing any public APIs for users to crawl or scrape large amount of data from their sites based on the reason of technical limit or any other reasons. Someone may propose RSS feeds, but for the reason they put a limit on their use, I will not suggest or make further comments on it. In this case, what I want to discuss is that we can build a crawler on our own to deal with this situation.
Before we discuss further, we need to know the essence of a crawler and how it works? A crawler, put it another way, is one method to generate a list of URLs you then feed through your extractor. The crawlers can be defined as tools to find the URLs, you first give the crawler a webpage to start, and they will follow all these links on that page either. Then this process will keep going on in a loop.
Then, we can proceed with building our own crawler. It’s known that Python is an open source programming language, and you can find many useful functional libraries. Here, I suggest the BeautifulSoup (Python Library) for the reason that it is easier to work with and possesses many intuitive characters. More exactly, I will utilize two Python modules to crawl the data. BeautifulSoup does not fetch the web page for us. That’s why, I use urllib2 to combine with the BeautifulSoup library. Then, we need deal with html tags to find all the links within page’s <a> tags and the right table. Then, we need to iterate through each row (tr) and then assign each element of tr (td) to a variable and append it to a list. Let’s first look at the HTML structure of the table (I am not going to extract information for table heading <th>). By taking this approach, your crawler is customized, that means it can deal with certain difficulty met in the API extraction, you can use the proxy to prevent from being blocked by some websites and etc. The whole process is within your control. This method should make sense for people with coding skills. The data frame you crawled is like the figure below.
While, to crawl a website on your own by programming may cost you sometime, even quite time-consuming. For some people without any coding skills, this would be a hard task. Therefore, I ‘d like to introduce some crawler tools for people who want to crawl data from websites with a faster speed, or even get instant feedback.
Octoparse is a powerful visual windows-based web data crawler. It is really easy for users to grasp this tool by using its simple and friendly user interface. To use it, you need download this application on your local desk-top. As the figure below shows, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Actually, Octoparse provides two editions of crawling service subscription plans - the Free Edition and Paid Edition. Anyway, both can satisfy the basic scraping or crawling needs of users. You can run your tasks on the local side and have data exported in various formats. More advance, if you switch your free edition to any Paid Edition, then you can share the Cloud-based service by uploading your task and configurations to the Cloud Platform, where there are 6 to 14 cloud serveres will run your tasks simultaneously with a higher speed and crawl data in a larger scale. Plus, you can automate your data extraction leaving without a trace using Octoparse’s anonymous proxy feature that could rotate tons of IP’s, which will prevent you from being blocked by certain websites. Octoparse also provides API creation to connect your system to your scraped data in real time. You can either import the Octoparse data into your own DB, or use our API to require access to your account’s data. After you finish your configuration of the task, you can export data in various formats as you need, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle).
Import.io is also known as a web crawler covering all different levels of crawling needs. It offers a Magic tool which can convert a site into a table without any training sessions. While it suggests users to download its desktop app if more complicated websites need to be crawled. Once you’ve built your API, they offer a number of simple integration options such as Google Sheets, Plot.ly, Excel as well as GET and POST requests. When you consider that all this comes with a free-for-life price tag and an awesome support team, import.io is a clear first port of call for those on the hunt for structured data. They also offer a paid enterprise level option for companies looking for more large scale or complex data extraction.
Mozenda is also a user-friendly web data extractor. It has a point-and-click UI for users without any coding skills to use. Mozenda also takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus, it allows advanced programming using REST API the user can connect directly Mozenda account. Plus, it provides the Cloud-based service and rotation of IPs as well.
SEO experts, online marketers and even spammers should be very familiar with ScrapeBox with its very user-friendly UI. Users can easily harvest data from a website to grab emails, check page rank, verify working proxis and RSS submission. BY using thousands of rotating proxies, you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked or detected.
Admittedly, those crawlers are powerful to meet people with complicated crawling or scraping needs. While if people just want to scrape data in a simple way, I suggest you choose the Google Web Scraper Plugin. It is a browser-based web scraper working like the Firfox’s Outwit Hub. You can download it as an extension and have it installed in your browser. You need highlight the data fields you’d like to crawl , right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs. The latest version still had some bugs on spreadsheets. Eventhough it is easy to handle, notice to all users, it can’t scrape images and crawl data in a large amount.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Most popular posts
- Related articles
- Scraping Data from Website to Excel
- Free Online Web Crawler Tools
- 20 Most Popular Business Intelligence (BI) To...
- Python - HTML Parser? You Need to Know XPath
- Cragslist CAPTCHA Bypass