Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Extract multiple pages through pagination

Thursday, August 16, 2018

Paginated content is everywhere. Take your favorite e-commerce site for example, rather than having all the products listed on one single page, they are more likely to be spread out into multiple pages and thus we have pagination. So if you were to scrape product data from the site, you would need to configure your task with pagination in order to include all the products listed on the different pages. 

 

This tutorial covers 2 common ways to deal with pagination:

1) Extract multiple pages using the "Next" button

2) Extract data from multiple pages with no "Next" button (Page number links)

 

 

 

 

 

1) Extract multiple pages using the "Next" button

(Example URL: https://www.yelp.com/search?cflt=hotels&find_loc=San+Francisco%2C+CA )

  • Load the list page/search result page in the built-in browser if you are not already there
  • When the page is loaded, locate and click on the "Next" button
  • From "Action Tips", select "Loop click next page"

 

Switch to the workflow mode by toggling the icon  at the upper right-hand side and notice a "Click to paginate" step is automatically generated and added to the workflow.

 

(To finish setting up the task, learn how to capture items on the list  and capture data from each item page by clicking on a list .)

 

Tips!

In case if the paginated content is loaded dynamically via AJAX, set up a 2 to 4 seconds AJAX timeout for the "Click to paginate" step. Do not set up AJAX timeout if the item does not use AJAX technique.

What is AJAX and how to deal with it 

 

 

 

 

 

2) Extract data from multiple pages when there is no "Next" button (Page number links only)

Sometimes the "Next" button is not available but only the page number links like this: 

 

In this case, we would need to modify the XPath of the "Click to paginate" action from the workflow. We'll first add a pagination loop using page number "1" although the loop will not work properly without further adjustment. 

(example URL: http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text= )

 

Tips!

The auto-generated pagination loop will not work properly here since we've selected page number "1" to loop through. With the current setup, Octoparse will simply keep clicking on "1" as it tries to paginate to the next page, leading to duplicated data being extracted endlessly. 

 

Now we need to modify the XPath of the "Click to paginate" action which is the most important part of dealing with page number type of pagination. 

The XPath syntax most often used here is "following-sibling" which selects the all the siblings after the current node.

For example, when we are on page 1, our goal is to click on page number "2" to get us to page 2, then subsequently page 3, so on and so forth.  

 

1) To do this, write the XPath to locate the select-page item first

Inspect the source code and locate the code for the current page selected (this can often be done by right-clicking on the page number "1" then select "Inspect Source Code" or similar command). In the example below, the code for the node of page 1 is: <li class="nav-pageitem selected">.

Thus, the XPath of the select-page item would be:

.//*[@class="nav-pageitem selected"]

 

2) Select the 2nd-page node with XPath Syntax, "following-sibling" 

As the 2nd page is found within first "li" tag following the current "li" node, the correct XPath would be:

.//*[@class="nav-pageitem selected"]/following-sibling::li[1]

3) For clicking on the links, we would need to locate the "a" tag, which means the anchor for linking to the 2nd page.

Now we have the complete XPath:

.//*[@class="nav-pageitem selected"]/following-sibling::li[1]/a

 

4) Replace the auto-generated XPath for the pagination loop with the new XPath 

  • Click on the pagination loop and refer to the setting on the right side, input the new XPath into the textbox for "Single element" 

 

5) Double-check the XPath to make sure it works for other pages

  • Click on the pagination loop
  • Click on the "Click to paginate" action 
  • Observe if the webpage has paginated to the subsequent page successfully
  • Repeat the above step for more pages

 

 

Tips!

While XPath is used to locate any particular items on a web page, it is based on the page's source code. Hence the XPath provided in this example will not apply to any other websites most likely but you can always apply the same method for writing the XPath that works for your target website. 

 

 

Related articles:

Locate elements with XPath 

Load with infinite scrolling/Load more 

Lesson 6: Pagination - Capture data from multiple pages 

Click on a list and capture data from each item page 

Getting data - Capture a list of items 

Deal with AJAX 

  

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png