How to Get Data from the Web

Most enterprises of any size are generating large amounts of web data, all the time. But how to deal with these data – data collection and data processing, it’s always a problem. The significance of Big data technologies does not lie in its ability to grasp with large-scale data collection, but in the intelligence to process data and thus extract valuable information from such a large-volume data for further analysis. And the premise of big data technologies is that we get large volume of valuable data.

Data analysis and data mining are not focused on the data itself, but on how to solve actual business problems during data operations. We can get valuable information from the data collected by performing data mining and data analysis, but the premise is we must ensure that the data collected are of high quality.

How to get data from the web? Or, specifically, the exact data you want from the webpage?

As a big fun of data mining, I’d like to share with you my experience of getting data from the web.

1. Web crawler

A web crawler (also known as web scraper, web spider, web robots) is an automated program/script that use to browse the internet and collect the data from web pages.

The most common way to retrieve web data from internet is using a crawler. You can crawl almost the data you see in the web pages after you know how to write a web crawler by using a programming language such as R and Python. It’s very convenient for an automated crawler to get large quantities of data from websites within a short time. For example, I used to collect 10w social media data, 200w lottery information, 100w travel information, 15w hotel information, 40w URLs from a website, and etc. After I get the data I will use regular expression (it’s a simple but powerful tool and also my favorite) to extract the exact strings from the data collected. If you are new to regular expression and want to use it, I’d like to recommend you to learn from many free online resources with regular expression testers. The more you practice with regular expression , the more familiar you are with it. Practice makes perfect. There are many free online regular expression testers available on the internet. I use regular expressions 101 a lot.

What if you don’t have any coding knowledge but want to get data from the web?

Fear not! A lot of web scraping tools for non-technical people are available online.

If you’re a little curious about web crawler and have a love of learning, I suggest you use some web scraping software such as Octoparse, import.io, webharvy and etc. It may take some time to learn, but once mastered it’s hard to find anything better.

If you’re not technical and have no time to learn web scraping software, you can consider outsourcing or any other ways to get the data from the web you need. Many companies will provide various professional web data scraping services to directly get the data from web if you need. Or you can hire some web scraping specialists to get the data for you.

2. Some websites that provide public data-sets

There are a number of freely public data-sets available online and you can easily download/buy them from the internet. Below are some common used websites that to retrieve public data.

It’s wonderful to make friends with people who are good at web crawling and share your experiences with them. You can easily find them in some web crawling forums/blogs. Many people love to crawl data from the web but don’t know how to better use the data they collected. You can learn from each other by sharing web crawling skills and expertise.