Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Why does Octoparse skips some pages?Wednesday, November 24, 2021
The latest version for this tutorial is available here. Go to have a check now!
Many users have encountered such case that Octoparse skips some pages when scraping a website. For example, after it successfully scrapes the first two pages, it directly jumps to the page 5, then maybe page 10, but not go to the pages in sequence.
That is caused by the auto-generated XPath of the pagination loop not always locating the next page button on every page.
Have a look at the following example: (Example URL)
On the first page, you can see the pagination loop XPath locates the next button perfectly.
However, on the second page, the XPath locates the page 10.
So after finishing scraping the second page, Octoparse would directly go to the page 10, missing a lot of data on the pages in between.
How to solve such skipping page issue?
It is easy to solve such issue: just modify the XPath to make sure it will always locate the next button.
You can firstly inspect the next button in FireFox to check the source code:
There is a title attribute in A tag. We can use this attribute to write the XPath: //a[@title='Next'] (Check out how to write an XPath here )
Enter the XPath into Octoparse to check if it can always locate the next button.
After making a pagination loop in a task, You'd better manually click the "Click to paginate" action to go to several pages as this tutorial shows to check if the auto-generated XPath could locate the next button precisely.
- Most popular tutorials
- Scrape tweets from Twitter
- Extract data from a list of URLs
- Extract multiple pages through pagination
- Scrape data on Instagram
- How to download images from a list of URLs?