Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Task / Workflow DebuggingMonday, January 14, 2019
Provided that the task we create with Octoparse doesn’t work as expected, how can we find out the bug in our task/workflow?
This article shows you how to debug your scraping task in Octoparse. By following these steps, we can debug the task/workflow on our own:
Step 1: Manually click through each step in the workflow
Generally speaking, when we click on a step in the workflow, the corresponding process is displayed in the built-in browser and details about this step are displayed in "Customize Action".
Since Octoparse executes each step from the top down, we should click this step in the top-down order.
The following example shows how to debug by manually clicking each step.
1. Click "Go To Web Page", the target webpage opens in the built-in browser. In addition, the Go-To-Web-Page action can be customized in the "Customize Action".
If the web page takes a long time to load, you may need to extend the Timeout.
2. Click the "pagination" loop to check whether the next-page button is located in the loop item area accurately.
The Loop item information, like the next-page button, should be shown in the "Customize Action".
Otherwise, if the loop item fails to locate on the next-page button or number accurately, we will need to modify the XPath of the "Pagination" loop. You could refer to these two tutorials: How to handle pagination with page numbers? and Extract multiple pages through pagination.
3. Click "Click to Pagination" to check whether the pagination works well.
If the action works well, the next page displays in the built-in browser. If not, you may need to modify the XPath for the "Pagination".
Besides, we need to be cautious about whether the website employs AJAX technique. If so, the "AJAX Timeout" is required to set up.
4. Click "Loop Item" to check if all the items on the current page are located accurately. The Loop item information will be shown in the "Customize Action".
5. Click "Click Item", to check if the corresponding process shows in the built-in browser accordingly.
Same as the step to check with "Click to paginate", we need to check whether the website refreshed with AJAX technique. If so, the "AJAX Timeout" is required to set up.
6. Click "Extract data" to check if the targeted data is extracted accurately.
If we have data extracted into the wrong "columns" or not being extracted at all, it may result from the inaccurate XPath, which can be solved by referring to the following tutorials:
· Before clicking to the next step, we must make sure the page is fully loaded, ie, the loading signal disappears.
· When we click the step of "Click Item" or "Extract data" in a loop, we need to select an option in the loop item in addition to the first option. By doing this, we can see whether the "Click Item" or "Extract data" step works.
Step 2: Run the task by Local Extraction
After making sure each step works well by manually clicking, we can run the task by Local Extraction to check if there are any bugs.
We can consider there are bugs when the following situation occurs:
· Getting no data extracted
When the reminder pops up, we’d better refer to Why Octoparse stops and no data is extracted?.
· Extracting duplicate data
When the task keeps producing duplicated data, there are bound to be problems with its "loop Item". We can get the solution from the article: Why does Octoparse only extract the first item and duplicate?
· Too many missing data when extracting, which may be caused by:
1) The "Loop Item" does not cover all the items on the list of each listing page.
2) The web page does not load completely.
· Extracting data at a relatively slow speed
If the local extraction runs quite slow, it is likely caused by the local environment, such as operating system, hardware capacity, IP address, network bandwidth and so on. Additionally, the content of the website also affects the scraping speed. For example, if you want to scrape the data from a website that contains a lot of images, it takes more time to fully load the page.
However, slow speed also could be a signal of a bug. For example, when we forget to set up AJAX Timeout for some steps, Octoparse will wait for 120 seconds by default to proceed on the step.
Step 3: Debug in Cloud Extraction (Premium User)
Before we move forward to debug in cloud extraction, we must make sure the task has already worked well by manually clicking each step in the workflow and running local extraction. We already have some tutorials about dealing with the situations occurred in cloud extraction, including:
· Data missing when using cloud extraction
If we notice that there are some missing data in the results of cloud extraction, we can refer to this tutorial: How to deal with data missing on cloud extraction?.
· Getting no data extracted on the cloud
Sometimes we can have the task running well locally. However, when running it in the cloud, we get no data extracted.
Then we can refer to the tutorial: Why does cloud extraction get no data while local extraction works perfectly?
Was this article helpful? Contact us any time if you need our help!
Author: Erika F
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar