Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Extract multiple pages through paginationThursday, June 9, 2022
Psst! New tutorials for Octoparse 8.4 are now available —— you are reading the 7.3 version. Would you like to read the new versions instead?
The World Wide Web is full of paginated content. Take any e-commerce site as an example, rather than having all the products listed on one single page, they are more likely to be spread out into multiple pages and thus we are faced with pagination. If you were to scrape product data from such a site, you would need to configure your task with pagination in order to include all the products from different pages.
This tutorial aims to provide instructions on dealing with 2 types of pagination：
1) Extract data from multiple pages with a "Next" button
- Load the list page/search result page in the built-in browser if you are not already there
- After the page is loaded, locate and click on the "Next" button
- Select "Loop click next page" from "Action Tips"
Switch to the workflow mode by toggling the icon at the upper right-hand side, you'll notice a "Click to paginate" step has been added to the workflow.
(To finish setting up the task, learn how to capture items on the list and capture data from each item page by clicking on a list .)
In case if the paginated content is loaded dynamically via AJAX, set up a 2 to 4 seconds AJAX timeout for the "Click to paginate" step. Do not set up AJAX timeout if the item does not use the AJAX technique.
2) Extract data from multiple pages when there is no "Next" button (Page number links only)
Sometimes a "Next" button is not available and we have to click on numbers to paginate:
In this case, we would need to modify the XPath of the "Click to paginate" action from the workflow. We'll first add a pagination loop using page number "1" although the loop will not work properly without further adjustment.
- Click on page number "1"
- From the "Action Tips", select "Loop click the selected link" to create a pagination "Loop Item". (Learn more about using XPath in Octoparse )
The auto-generated pagination loop will not work properly here since we've selected page number "1" to loop through. With the current setup, Octoparse will simply keep clicking on "1" as it tries to paginate to the next page, leading to duplicated data being extracted endlessly.
Now we need to modify the XPath of the "Click to paginate" action which is the most important part of dealing with the page number type of pagination.
The XPath syntax most often used here is "following-sibling" which selects all the siblings after the current node.
For example, when we are on page 1, our goal is to click on page number "2" to get us to page 2, then subsequently page 3, so on and so forth.
1) To do this, write the XPath to locate the select-page item first
Inspect the source code and locate the code for the current page selected (this can often be done by right-clicking on the page number "1" then selecting "Inspect Source Code" or similar command). In the example below, the code for the node of page 1 is: <li class="nav-pageitem selected">.
Thus, the XPath of the select-page item would be:
2) Select the 2nd-page node with XPath Syntax, "following-sibling"
As the 2nd page is found within the first "li" tag following the current "li" node, the correct XPath would be:
3) For clicking on the links, we would need to locate the "a" tag, which means the anchor for linking to the 2nd page.
Now we have the complete XPath:
4) Replace the auto-generated XPath for the pagination loop with the new XPath
- Click on the pagination loop and refer to the setting on the right side, input the new XPath into the textbox for "Single element"
5) Double-check the XPath to make sure it works for other pages
- Click on the pagination loop
- Click on the "Click to paginate" action
- Observe if the webpage has paginated to the subsequent page successfully
- Repeat the above step for more pages
While XPath is used to locate any particular items on a web page, it is based on the page's source code. Hence the XPath provided in this example will not apply to any other websites most likely but you can always apply the same method for writing the XPath that works for your target website.
- Most popular tutorials
- Is Octoparse able to handle CAPTCHA/reCAPTHCA?
- How to download extracted web data as CSV, XLS, JSON or HTML?
- Run/Schedule tasks in the cloud
- Run tasks on local machine
- Text/keyword input