Web Scraping: How It All Started And Will Be
Monday, October 22, 2018
What is web scraping?
Web scraping, also known as web harvesting and web data extraction, basically refers to obtain data available on World Wide Web via the Hypertext Transfer Protocol (HTTP) or through web browsers.
How does web scraping work?
Generally, when web scraping a web page, only 2 steps are involved.
Fetch a web page → Copy the specific data out of the page into a spread sheet or database.
How did it all start?
Though to many people it sounds like a technique as fresh as concepts like “Big Data” or “machine learning”, the history of the web scraping is actually much longer, which dates back significantly to the time when the World Wide Web, or colloquially “the Internet”, was born.
At the very beginning, the Internet was even unsearchable. Before search engines were developed, the Internet was just a collection of File Transfer Protocol (FTP) sites in which users would navigate to find specific shared files. To find and organize distributed data available on the Internet, people created a specific automated program, known as the web crawler/bot today, to fetch all pages on the Internet and then copy all content into databases for indexing.
Then the Internet grows, eventually becoming the home to millions of web pages that contain a wealth of data in multiple forms, including texts, images, videos and audios. It turns into an open data source.
As the data source became incredibly rich and easily searchable, people started to find it simple to seek the information they want, which often spread across a large number of websites, but the problem occurred when they wanted to get data from the Internet—not every website offered download options, and copying by hand was obviously tedious and inefficient.
And that’s where web scraping came in. Web scraping is actually powered by web bots/crawlers that function the same way those used in search engines. That is, fetch and copy. The only difference could be the scale. Web scraping focuses on extracting only specific data from certain websites whereas search engines often fetch most of the websites around the Internet.
· 1989 The birth of World Wide Web
Technically, World Wide Web is different from the Internet. The former refers to the information space, while the latter is the network made up of computers.
Thanks to Tim Berners-Lee, the inventor of WWW, he brought the following 3 things that have long been part of our daily life -
- Uniform Resource Locators (URLs) which we use to go to the website we want;
- embedded hyperlinks that permit us to navigate between the web pages, like the product detail pages on which/where we can find product specifications and lots of other things like “customers who bought this also bought”;
- web pages that contain not only texts, but also images, audios, videos and software components.
· 1990 The first web browser
Also invented by Tim Berners-Lee, it was called WorldWideWeb (no spaces), named after the WWW project. One year after the appearance of the web, people had a way to see it and interact with it.
· 1991 The first web server and the first http:// web page
The web kept growing at a rather mild speed. By 1994, the amount of HTTP servers was over 200.
· 1993-June First web robot - World Wide Web Wanderer
Though functioned the same way web robots today do, it was intended only to measure the size of the web.
· 1993-December First crawler-based web search engine - JumpStation
As there were not so many websites available on the web, search engines at that time used to rely on their human website administrators to collect and edit the links into a particular format.
JumpStation brought a new leap. It is the first WWW search engine that relies on a web robot.
Since then, people started to use these programmatic web crawlers to harvest and organize the Internet. From Infoseek, Altavista, and Excite, to Bing and Google today, the core of a search engine bot remains the same:
find a web page, download (fetch) it, scrape all the information presented on the web page, and then add it to the search engine’s database.
As web pages are designed for human users, and not for ease of automated use, even with the development of the web bot, it was still hard for computer engineers and scientists to do web scraping, let alone normal people. So people have been dedicated to making web scraping more available.
· 2000 Web API and API crawler
API stands for Application Programming Interface. It is an interface that makes it much easier to develop a program by providing the building blocks.
In 2000, Salesforce and eBay launched their own API, with which programmers were enabled to access and download some of the data available to the public.
Since then, many websites offer web APIs for people to access their public database.
Send a pasted together HTTP request, receive JSON or XML in return
Web APIs offer developers a more friendly way to do web scraping, by just gathering data provided by websites.
· 2004 Python Beautiful soup
Not all websites offer APIs. Even if they do, they don’t provide all data you want. So programmers were still working on developing an approach that could facilitate web scraping.
In 2004, Beautiful Soup was released. It is a library designed for Python.
In computer programming, a library is a collection of script modules, like commonly used algorithms, that allows to be used without rewriting, simplifying the programming process.
With simple commands, Beautiful Soup makes sense of site structure, and helps parse content from within the HTML container. It is considered the most sophisticated and advanced library for web scraping, and also one of the most common and popular approaches today.
· 2005-2006 Visual web scraping software
In 2006, Stefan Andresen and his Kapow Software (acquired by Kofax in 2013) launched Web Integration Platform version 6.0, something now understood as visual web scraping software, which allows users to simply highlight the content of a web page and structure that data into a usable excel file, or database.
Finally, there’s a way for the massive non-programmers to do web scraping on their own.
Since then, web scraping is starting to hit the mainstream. Now for non-programmers, they can easily find more than 80 out-of-box data extraction software that provide visual process.
How web scraping will be?
People always want data. We collect data, process data, and turn data into various things, such as researches, insights, information, stories, assets, etc,. We used to spend lots of time, effort and money on simply seeking and collecting the data, so much that only big companies and organizations could afford.
In 2018, what we know as the World Wide Web, or colloquially “the Internet”, is made up of over 1.8 billion websites. Such an incredibly huge amount of data is now available with just a couple of clicks. And as more people are coming online, more data is generated every second.
It is an age being easier than any age that we have ever had in the history. Any individual, company and organization is able to obtain the data they want, as long as it is available on the web. Thanks to the web crawler/bot, API, standard libraries, and various out-of-box software, once anyone has a will to get data, there’s a way for them. Or they can also just turn to professionals who are both accessible and affordable.
When searching “web scraping” on guru.com, you can get 10,088 search results, which means more than 10,000 freelancers are offering web scraping service on the website.
And the number is 13,190 on Upwork, and 1,024 on fievere.com.
The rise in demand for web data by companies across industry verticals keeps driving the web scraping industry, bringing new markets, jobs, and business opportunities.
Meanwhile, like any other emerging industry, web scraping brings legal concerns as well.
The legal landscape surrounding the legitimacy of web scraping continues to evolve. Its legal status remains highly context-specific. For now, many of the most interesting legal questions emerging from this trend remain unanswered or depend on very specific factual context.
Though web scraping has been practiced for a rather long time, courts are only beginning to scratch the surface of how relevant legal theories might apply in the context of big data.
It is still unpredictable and volatile at the moment, as the landscape relating to web crawling and scraping was still taking shape. However, one thing is for sure, that is, as long as there is the Internet, there is web scraping.
It is web scraping that made the freshly born Internet searchable, and then the explosively growing Internet more usable and accessible.
There’s no doubt that they, the Internet and web scraping, will keep going along like this with each other in the foreseeable future.
Most popular posts
- Related articles
- Making Web Scraping Easier
- Web Scraping: How It All Started And Will Be
- Data Insight: 54 Industries Using Web Scrapin...
- Top 5 Social Media Scraping Tools for 2018
- Data Insight: What People Are Tweeting about ...