Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Task / Workflow Debugging

Monday, January 14, 2019

 

Provided that the task we create with Octoparse doesn’t work as expected, how can we find out the bug in our task/workflow?

This article shows you how to debug your scraping task in Octoparse. By following these steps, we can debug the task/workflow on our own:

Step 1: Manually click through each step in the workflow

Step 2: Run the task by Local Extraction

Step 3: Debug in Cloud Extraction (Premium Users)

 

 

 

 

 

Step 1: Manually click through each step in the workflow

Generally speaking, when we click on a step in the workflow, the corresponding process is displayed in the built-in browser and details about this step are displayed in "Customize Action".

 

Since Octoparse executes each step from the top down, we should click this step in the top-down order.

The following example shows how to debug by manually clicking each step.

 

1. Click "Go To Web Page", the target webpage opens in the built-in browser. In addition, the Go-To-Web-Page action can be customized in the "Customize Action".

If the web page takes a long time to load, you may need to extend the Timeout.

 

 

 

2. Click the "pagination" loop to check whether the next-page button is located in the loop item area accurately.

The Loop item information, like the next-page button, should be shown in the "Customize Action".

 

 

Otherwise, if the loop item fails to locate on the next-page button or number accurately, we will need to modify the XPath of the "Pagination" loop. You could refer to these two tutorials: How to handle pagination with page numbers? and Extract multiple pages through pagination.

 

 

3. Click "Click to Pagination" to check whether the pagination works well.

If the action works well, the next page displays in the built-in browser. If not, you may need to modify the XPath for the "Pagination".

Besides, we need to be cautious about whether the website employs  AJAX technique. If so, the "AJAX Timeout" is required to set up.

 

4. Click "Loop Item" to check if all the items on the current page are located accurately. The Loop item information will be shown in the "Customize Action". 

 

 

 

5. Click "Click Item", to check if the corresponding process shows in the built-in browser accordingly.

Same as the step to check with "Click to paginate", we need to check whether the website refreshed with AJAX technique. If so, the "AJAX Timeout" is required to set up. 

 

6. Click "Extract data" to check if the targeted data is extracted accurately.

 

If we have data extracted into the wrong "columns" or not being extracted at all, it may result from the inaccurate XPath, which can be solved by referring to the following tutorials:

Locate elements with XPath

How to associate data with nearby text?

Data fetched to the incorrect data fields

 

 

 

Tips!

· Before clicking to the next step, we must make sure the page is fully loaded, ie, the loading signal disappears.

· When we click the step of "Click Item" or "Extract data" in a loop, we need to select an option in the loop item in addition to the first option. By doing this, we can see whether the "Click Item" or "Extract data" step works.

 

 

 

 

Step 2: Run the task by Local Extraction

After making sure each step works well by manually clicking, we can run the task by Local Extraction to check if there are any bugs.

We can consider there are bugs when the following situation occurs:

 

·  Getting no data extracted

 

When the reminder pops up, we’d better refer to Why Octoparse stops and no data is extracted?.

 

·  Extracting duplicate data

 

When the task keeps producing duplicated data,  there are bound to be problems with its "loop Item". We can get the solution from the article: Why does Octoparse only extract the first item and duplicate?

 

·  Too many missing data when extracting, which may be caused by:

   1) The "Loop Item" does not cover all the items on the list of each listing page.

             How to deal with missing items when creating a list? 

  

    2) The web page does not load completely.

              Why I have data missing/no data even I do see it in workflow

 

·  Extracting data at a relatively slow speed

If the local extraction runs quite slow, it is likely caused by the local environment, such as operating system, hardware capacity, IP address, network bandwidth and so on. Additionally, the content of the website also affects the scraping speed. For example, if you want to scrape the data from a website that contains a lot of images,  it takes more time to fully load the page.

However, slow speed also could be a signal of a bug. For example, when we forget to set up AJAX Timeout for some steps, Octoparse will wait for 120 seconds by default to proceed on the step.

 

 

 

 

 

 

Step 3: Debug in Cloud Extraction (Premium User)

Before we move forward to debug in cloud extraction, we must make sure the task has already worked well by manually clicking each step in the workflow and running local extraction. We already have some tutorials about dealing with the situations occurred in cloud extraction, including:

·  Data missing when using cloud extraction

If we notice that there are some missing data in the results of cloud extraction, we can refer to this tutorial: How to deal with data missing on cloud extraction?.

 

·  Getting no data extracted on the cloud

Sometimes we can have the task running well locally. However, when running it in the cloud, we get no data extracted.

 

Then we can refer to the tutorial: Why does cloud extraction get no data while local extraction works perfectly?

 

 Was this article helpful? Contact us any time if you need our help!

 

 

Author: Erika F

Editor:Suire M

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_form.png