undefined

Check The Extraction Rule When Errors Occur

Wednesday, July 20, 2016 10:52 PM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Isn't it excited that you are about to finish your first scraping task? There is just one more thing you should do (or better do) before running your task - test your workflow step by step to make sure things are working just as expected. With a test run, you'll see if you'd need to adjust your task settings to have the data captured accurately.

To demonstrate the process, we'll keep on using the test site as an example: http://test-sites.octoparse.com/?product_cat=e-commerce-category-1 

Test-run workflow steps

It's always good to remember that the steps of the workflow should always be read from top to bottom, and from inside to outside for nested steps. 

So for our example, we should test the steps in this order:

  1. "Go to Web Page" → test if the web page loads properly
  2. "Pagination" → test if the Next Page button is located correctly
  3. "Click to Paginate" → test if the web page paginates properly
  4. "Loop Item" → test if the list of items is complete and correct
  5. "Extract Data" → test if the data is selected and extracted correctly

 

check the workflow

 

 

It is necessary to mention that not all tasks are created the same, you may have a completely different task to test with, but the testing methodology can generally be extended to tasks of all kinds. Let's get started!

1. Click on "Go to Web Page" 

Once you click on the step, it should load the web page in the built-in browser. If the web page loads well, there's nothing to worry about; however, there are a few things you should always pay extra attention to.

1.1 If the web page loads with infinitive scroll-down → you need to select "Scroll down the page after it is loaded" and complete the proper settings. 

 

1.2 If the web page is taking longer than usual to load → you may want to increase the page timeout. Click "General" → "Timeout" to pick an appropriate break time.

 

2. Click the "Pagination" box

In order for pagination to work consistently, there are two things we need to check:

  • If the Next Page button/arrow is being located correctly.
  • If the paginating process works well on all pages, for instance, it needs to paginate correctly going from page 1 to page 2, page 2 to page 3, page 3 to page 4, etc.

After you click on the pagination box, go to the highlighted element on the web page and confirm if it is the correct Next Page button. If you don't have the right Next button, you may need to manually fix it by altering the corresponding XPath

check the xpath of next page

 

3. Click on "Click to Paginate" 

When you click on "Click to Paginate", you are literally instructing Octoparse to click on the Next Page button defined in Step 2. If things are working correctly, it should go from page 1 to page 2. Repeat this two-steps process (click "Pagination" box then click "Click to Paginate") as many times as needed to make sure pagination is working correctly on all sequential pages. If the web page is not paginating properly on any of the pages, fix the element XPath in step 2 and test again. 

Tips!

Check out these pagination troubleshooting ideas:

1. Dealing with pagination (with a "Next" button)

2. Dealing with pagination (No "Next" button)

3. Dealing with pagination (Infinitive Scroll)

4. Why does Octopasre skip pages during the scrape? (Version 8)

5. Why does Octoparse keep scraping the last page and never stop?

 

4. Click on the "Loop Item" box

Testing the "Loop Item" is essentially confirming if all the desired items have been selected correctly.

Once clicked, go to the web page in the built-in browser and make sure all the items you need are being highlighted.

check the loop items

 

Tips!

If your list is not complete upon testing, you can check out the troubleshooting ideas below:

1. Loop Item

2. What to do if Octoparse does not recognize all the elements in the list? (Version 8)

3. Using Loop with a click, extract and other actions (Ver. 8)

 

5. Click on "Extract Data"

Here is the final step - check if the data is being extracted as needed.

Once clicked, check the data in the preview section and confirm if this is the data that you need. 

Tips!

If you see any blank fields or if you find misplaced data, you can check out these troubleshooting ideas:

1. How to fix field issues? (Missing, blank, misplaced fields )(Version 8)

2. Locate and scrape an element via nearby text

  

Perform a test run 

After you have gone through each step in the task workflow, it is the perfect time to perform a test run on your local device. Click "Run" and select "Run task on your device".  

 

Now watch your data get extracted live!

Tips!

Check out the FAQs below for why you are not getting the data you need. 

1. Why is Octoparse stops and no data is extracted?

2. Why does Octopasre only click the first item in a Loop and stop? (Version 8)

3. Octoparse managed to get data from the first page but stops going to the rest of the pages?

4. Why does Octoparse stop after clicking "Next"?

5. Why do I get so many duplicates? (Version 8)

If none of these solves the problem, you can contact us for assistance.

 

Now you know your task is working right, it's time to get data for real!

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline