How to Scrape Data, Save Information from ANY Website for Offline Viewing?Monday, June 28, 2021
Data is fuelling every business. By 2024, globally we shall be consuming 149 ZetaBytes of data. Just so that you can understand how big that number is - mathematically it’s represented as 10247 bytes. This spurt in data is highly attributed to rapid digitalization across the globe. Data analytics is not new. But the approach is new. Humans have always been analyzing data in one way or the other. But humans are not as efficient as machines to process big data. Machines haven’t yet surpassed human intelligence but they have outshined us in terms of efficiency. Data science & machine learning is leveraging big data to make more accurate and validated intelligent business decisions.
Table of Contents
Tomorrow’s business leaders are
- harvesting data today,
- analyzing data,
- milking value from data,
- devising strategies & executing them to lead the future.
But where is this data? You can find it on your website, as well as other websites and apps, business portals, social media platforms, IoT sensors, etc.
How do you get access to this data? Well, most of the publicly available data can be scraped from websites either manually(not recommended), or data can be scraped in an automated fashion (recommended, find details in the next sections). Depending on your use case, you may also purchase data from third parties (but this can be a cost-intensive deal, besides you have no control over the quality of data).
- If you're in the FMCG business and need product data, you can scrape multi-vendor e-commerce websites or your competitor’s online websites and e-commerce stores to grab highly relevant data.
- If you’re in the travel & hospitality sector and need restaurant, hotel & location data, you may scrape Google Maps, TripAdvisor, Booking.com, and several others based on your requirements.
- For research and other requirements, you may scrape news portals, government websites, and scientific research paper aggregator websites.
- If you need Jobs and Vacancy related data, you may scrape indeed.com, naukri.com, linkedin.com, or other relevant websites.
Before we proceed further, it’s good to understand the difference between web scraping and screen scraping:
- Web scraping primarily extracts data from the web i.e., websites and applications hosted online. These websites are generally accessible to the public. Example - e-commerce websites, travel portals, news websites, etc.,
- Screen scraping is a more generic form of web scraping. What does that mean? It means anything accessible via digital screens can be scraped using screen scraping tools. Example - Banking websites, ERP database applications, etc.,
This article solely focuses on web scraping tools & techniques. Now, having explained where the data is and how you get access to sample data, let’s explore why automated scraping should be preferred over manual scraping.
Why Should You Choose Automated Scraping Over Manual?
You can collect data from websites in two ways:
- Employ humans to the task of scraping data i.e., manual scraping
- Employ bots (computer programs) to collect data and save it in JSON, spreadsheets, or raw documents.
Manual website scraping is the easiest way to start data extraction. But we don’t recommend it for any scraping task. This should only be preferred if your data requirements are way too small. Say, you only need data about 10 products and that too just once. For, anything above that, automated bot scraping would prove way more efficient and will help you in saving time, money, resources.
How do humans scrape data?
It’s as simple as pointing your cursor to the target data, selecting it, and copy/pasting it to your target database.
What’s the drawback of manual data scraping?
- It’s damn slow. Yes, slower than the three-toed sloths.
- It’s costly, as humans do charge money.
- It’s prone to human-triggered errors.
- It’s not scalable. Technically it is but that would mean spending millions of dollars for something that can be achieved by spending merely a few hundred or thousands.
How is automated website scraping performed?
There are two ways to perform automated website scraping:
- Using Web Scraping Tools
- Using Custom Scripts For Automating Data Scraping
Website Scraping Using Web Scraping Tools
There are tools, I would call them smart browsers, which can be taught to imitate repetitive human actions. Once you train them to perform certain actions, they can repeat the task any number of times. Octoparse is one such smart web scraping tool. The best of these web scraping tools are intuitive. You use them as you would use a normal web browser. The only difference is, here you teach the browser to extract the data of your interest. We have shown a demo towards the end of this insight. You don’t need to know any coding for using web scraping tools like Octoparse. But knowing Xpaths and regular expressions (regEx) is helpful.
Follow these resources to learn more about Xpath & RegEx:
What are the benefits of using web scraping tools for extracting and saving data from websites?
- Easy to start, click & extract. These tools have almost zero to a very small learning curve. If you know “how to click mouse buttons, you can start using web scraping tools”.
- Highly scalable, you can scrape millions of data points at blazing fast speed.
- Cost-efficient, as bots are put to work. The costs incurred using web scraping tools are exponentially less than manual scraping.
- Auto handling of anti-scraping website architectures. Many scraping tools have mechanisms to bypass anti-bots architectures like captchas, website fingerprints, and cookies-led bot bans.
- Allows you to extract data in your desired format: JSON, .xls, etc., or to your desired databases like MongoDB, MySQL, etc.
- Enables you to schedule and periodically scrape data from websites.
- Also, you can scrape data in the cloud and scale your resources up or release your resources when there is no need.
Why not use “click and extract” web scraping tools?
- If your data requirements are very small i.e if you need to scrape only 1 or two pages.
- If your source website is highly unstructured i.e., varying patterns
Website Scraping Using Custom Scripts
This is a lot similar to using web scraping tools. But unlike web scraping tools, you don’t get to click and extract the data. Instead, you write a bot using a scripting language of your choice- Python, nodeJS, PHP, Java, etc., And you imitate human interactions with the website. Later, you run the scripts locally on your system or in the cloud to scrape the data.
What are the benefits of scraping websites using custom scripts?
- Ridiculously Scalable
- Highly Customizable
- Cost-efficient for large scale scraping
- Can be scheduled to perform the periodic scraping
Why not scrape the web using custom scripts?
- When the data source is highly structured. Web scraping tools should be preferred as it gets you started relatively faster
- Huge learning curve
- Automation engineers command a high salary, which you need to pay
- You have to handle anti-scraping techniques on your own. This sometimes is a huge overhead.
- You have to write scripts for storing data in the database.
How to scrape data from any website?
Now, we shall demonstrate scraping Booking.com using Octoparse. This shall be useful in building hotel aggregator websites or devising the right pricing strategy for your Hotels.
Scraping with Octoparse is only a three-step process.
Step 1: Enter your target URL.
In our case, this is our target URL.
Step 2: Choose the data points that need to be scrapped.
For the demo, we shall scrape - Hotel name, star ratings, address, price
Step 3: Run the extraction template and scrape the data.
Let’s explore in detail:
Post login, click on “Advanced Mode Task”.
In the next screen, enter the URL:
And click on “Save Url”.
Turn on the workflow mode on the next screen.
You’ll see the following screen:
Click on any of the following pagination links:
and choose loop click single element from the Action Tips component:
Now, click on the pagination box and update the Xpath to:
Now, you need to click on “Go To Web Page” to move back to the first page.
And then click on the pagination box. So that loop extract operations can be performed correctly.
Now, click on all the data points which you seek to extract.
On the “Action Tips” component click “Extract data”. Then click on Field Names and update with your desired names.
Click on “Save” and then on “Start extraction”.
You would see the following screen. Click on “Local Extraction”.
We can also extract data in the cloud, but for demonstration, we would stick to local extraction.
You would see the following screen on the successful execution of this demo:
Once the scraping is complete, or if you manually stop the scraping, you can extract data in the following format :
We saved sample data in a Google spreadsheet.
Here is a snapshot:
In this insight, we saw
- How to scrape data from the web, and
- How to save scraped data in your desired format, to your preferred database.
- We also demonstrated
- how to scrape booking.com using Octoparse, and
- saved the data in .xls format to view it in Google Sheets
Octoparse is your go-to tool for all your scraping needs. You can create workflows that feed your ETL pipeline with highly structured data. Using Octoparse you can -
- Use pre-built templates to scrape popular websites like Amazon, Indeed, etc.,
- Build APIs and use them in your application.
- Prepare custom workflows to scrape complex websites
- Store data in XLS JSON, HTML, CSV, or your database
- Scrape in the cloud
For more resources on scraping, refer to this.