How to Maintain Data Quality While Web ScrapingMonday, June 08, 2020
When scraping the web, it is critical to get data with high-quality, especially when scraping at a large scale or extracting website data where data accuracy is extremely important. In this article, I will discuss how to improve your web crawler to maintain your data quality and get accurate data.
Web scraping challenges that may affect data quality
Website structure changes
Websites are constantly updating their UIs and layouts to appeal to more visitors. As a web crawler is usually built according to the webpage structure at the current time, it needs to get updated every now and then. If a website changes its structure drastically, a crawler may not be able to pull data from it anymore.
Some websites require you to login first before scraping any content. When running into websites that require login, your crawler may get stuck and won’t be able to pull any data at all.
Wrong data fetched
When selecting elements on a complex webpage, it may be hard to locate the target data, as the auto-generated Xpath in web crawlers may not be accurate. In this case, you may fetch wrong data from the page.
Only a limited amount of data can be pulled
Another disadvantage of not locating the correct data is that the bot cannot click on a target button to open a new page, such as the pagination button(s). In such situations, the bot may scrape the first page repetitively without going to the following pages.
Data extracted is incomplete
A common scenario when scraping websites is that some sites, like Twitter, only load more content when you scroll down the page. If the page is not scrolled down and no data is displayed, the bot won’t be able to get complete data sets.
Of course, there are many other factors that affect data quality, and the above-mentioned scenarios are just some common ones. So what are some methods we can adopt to optimize a crawler in Octoparse and maintain data quality when scraping the web?
Method #1: Switch auto-detected data results
Octoparse is an automated web scraper. When building a web crawler in Octoparse, it auto-detects the data on any web page, screens the web page and fetches one or more data sets using its machine learning algorithm. If the auto-detected data is not your target data, you can switch to the other sets of data by clicking on "Switch auto-detect results". This eliminates the trouble of writing Xpath for target data, and greatly ensures high data quality.
Method #2: Build your crawler from scratch
However, if none of the auto-detected results rise to your expectations, you will need to set up the workflow manually. But don’t worry, Octoparse simulates human interaction with webpages, meaning that it provides you with the ability to conduct web scraping with point-and-click. If you’ve never built a crawler from scratch with Octoparse before, I highly recommend you give yourself a spin by following this video and this tutorial.
As different websites vary in layouts and structures, not all data can be accessed instantly.
Case 1: Extract data behind login
There are websites that would require login credentials, so you will need to provide your user account to allow Octoparse to open the page and scrape the data for you.
Case 2: Enter a (list of) search term(s)
Or, you may need to enter a single term or a list of search phrases first, so that Octoparse can fetch corresponding search results.
Case 3: Click through a drop-down menu
Another common scenario is that you need to choose a category from the drop-down menu which will bring you to a new page.
Method #3: Locate the pagination button(s) to extract more pages
To extract data from a number of pages, you will need to set up pagination by clicking the page buttons. To do so, you will need to click the single “next page” button or the page numbers.
Usually, Octoparse will detect the “next page” button automatically. If the “next page” button isn’t detected, you can check out this video to see how you can troubleshoot this problem.
On the other hand, if some websites have pagination bars that display page numbers, you will need to click through the page numbers to visit respective pages. To do this, you will need to edit Xpath to locate the page numbers. XPath stands for XML path language. It can be used to locate data precisely. You can follow this tutorial to learn to handle pagination with page numbers.
Method #4: Scroll down the page multiple times to load more data
For lengthy websites, data can only be scraped successfully after the page gets fully loaded. In order to get more data displayed, we need to scroll down the page. Usually, Octoparse automatically scrolls down the page when a web page is loaded in the built-in browser. You can easily edit the number of scroll times to control the amount of content that will display, which gives you complete data from web page(s).
As Octoparse is an intelligent web scraping tool, there are even more options for you to optimize your crawler and ensure high data quality.
Method #5: Edit the workflow
To understand the meaning of a workflow is the key to successfully building a crawler. A workflow demonstrates the whole extraction process, with each step representing the corresponding process.
No matter if Octoparse generated the workflow automatically, or you built the workflow manually, you can edit the workflow to ensure the task does what you tell it to do.
Each step has various settings that you can modify and fine-tune your scraping task.
1. Rearrange the steps of the workflow by simple drag-and-drop
In most cases, in order to make the workflow work properly, the loop item should be inside the pagination loop as the bot needs to finish scraping all the items within the current page before proceeding to the following pages. However, if the auto-generated workflow is not correct, you can always drag it back inside to repair the workflow.
2. Add steps to the workflow
If you want to add more actions to the workflow, simply place the mouse in between till a little plus sign shows up. Octoparse provides multiple options, which gives you the ability to edit more advanced crawlers to deal with complex websites.
3. Rename, copy, or delete steps
When building your workflow, you may make mistakes and want to undo the action. This can be done easily with Octoparse as you can rename, open, copy, paste, and delete any steps by clicking the ellipsis icon.
4. Hover over and check general settings of each step
You can easily check the settings of each step by hovering your cursor over the workflow. For instance, you can check the URL of the current webpage, the wait time before each step, the data to be extracted, so on and so forth.
5. Modify detailed settings of each step
Last but not least, after checking the general settings of each step, if you want to make further changes on the details, simply click on the settings icon and you will find various options. And now, you can roll up your sleeves to edit the detailed settings of each step as the way you want them to be.