Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Why does Octoparse keep scraping the last page and never stop?

Thursday, August 16, 2018

Extracting data from multiple pages through pagination is a very common case since most of the time you need more than one page of data for your project.

But you might find that Octoparse sometimes keeps scraping the last page and never stops. This may cause by the endless loop of pagination due to the "Next" button can still be detected on the last page.  Here we offer two solutions for you to fix this issue - You can set up “End loop when” condition under “Advanced Options” of the pagination loop item, or modify XPath to prevent an endless loop.

 1. Set up loop ending condition - "End Loop When"

The "End Loop When" option allows you to stop the pagination at a specific page. When you set up the "End Loop When" condition for pagination step, the pagination loop would stop when it reaches the execution times configured. 

(e.g. If you would like to scrape the first 50 pages of data, you can set up 49 execution times of execution.)

So if you know the exact times of the pagination clicks, you can set up “End loop when” option to resolve the problem perfectly.

Select the pagination loop firstly, and then open “End loop when” drop-down menu. Click “Execution time reach”, pick a number as your loop execution times and click “OK” to save your configuration.

 

2. Modify XPath

If the issue cannot be resolved by setting up the loop ending condition, you may need to modify the XPath of the pagination loop.

Here we use an example to show you how to end the loop by XPath modification. 

On the below two screenshots, you can see the “Next” button is located by an XPath auto-generated by Firebug plugin on the first page and the last page.

(We suggest you use firebug plugin in Firefox browser, Firebug is now only available for old versions of Firebox (e.g. Version 54), click here  to get the old versions of Firebox. )

 

On the first page:

On the last page:

 

The "class" attribute under "a" tag on the first page is different from the last page. One is “gspr next”, while another is “gspr next-d”.

Now we will make use of this feature to write a new XPath to locate “Next” button in order to make sure that the “Next” button is not available on the last page.

So the new XPath should be //a[@class='gspr next'].

Just paste the new XPath into Firebug to verify if it can locate the “Next” button both on the first page and the last page.

On the first page:

On the last page:

 

 

If you would like to learn more about XPath modification, check out our XPath tutorial:

https://www.octoparse.com/tutorial-7/xpath/

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png