undefined

Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Extract multiple pages through pagination

Monday, December 27, 2021

Psst!  New tutorials for Octoparse 8.4 are now available —— you are reading the 7.3 version. Would you like to read the new versions instead? 

 

The World Wide Web is full of paginated content. Take any e-commerce site as an example, rather than having all the products listed on one single page, they are more likely to be spread out into multiple pages and thus we are faced with pagination. If you were to scrape product data from such a site, you would need to configure your task with pagination in order to include all the products from different pages. 

 

This tutorial aims to provide instructions on dealing with 2 types of pagination:

1) Extract data from multiple pages with a "Next" button

2) Extract data from multiple pages with Page number links (No "Next" button)

 

 

1) Extract data from multiple pages with a "Next" button

(Example URL: https://www.yelp.com/search?cflt=hotels&find_loc=San+Francisco%2C+CA )

  • Load the list page/search result page in the built-in browser if you are not already there
  • After the page is loaded, locate and click on the "Next" button
  • Select "Loop click next page" from "Action Tips"

 

Switch to the workflow mode by toggling the icon  at the upper right-hand side, you'll notice a "Click to paginate" step has been added to the workflow.

 

(To finish setting up the task, learn how to capture items on the list  and capture data from each item page by clicking on a list .)

 

Tips!

In case if the paginated content is loaded dynamically via AJAX, set up a 2 to 4 seconds AJAX timeout for the "Click to paginate" step. Do not set up AJAX timeout if the item does not use the AJAX technique.

What is AJAX and how to deal with it 

 

 

2) Extract data from multiple pages when there is no "Next" button (Page number links only)

Sometimes a "Next" button is not available and we have to click on numbers to paginate: 

 

In this case, we would need to modify the XPath of the "Click to paginate" action from the workflow. We'll first add a pagination loop using page number "1" although the loop will not work properly without further adjustment. 

(example URL: https://www.meetup.com/en-AU/The-Entrepreneur-Club-Melbourne/members/?_cookie-check=jgrO7x5nYzelIGAh)

 

Tips!

The auto-generated pagination loop will not work properly here since we've selected page number "1" to loop through. With the current setup, Octoparse will simply keep clicking on "1" as it tries to paginate to the next page, leading to duplicated data being extracted endlessly. 

 

Now we need to modify the XPath of the "Click to paginate" action which is the most important part of dealing with the page number type of pagination. 

The XPath syntax most often used here is "following-sibling" which selects all the siblings after the current node.

For example, when we are on page 1, our goal is to click on page number "2" to get us to page 2, then subsequently page 3, so on and so forth.  

 

1) To do this, write the XPath to locate the select-page item first

Inspect the source code and locate the code for the current page selected (this can often be done by right-clicking on the page number "1" then selecting "Inspect Source Code" or similar command). In the example below, the code for the node of page 1 is: <li class="nav-pageitem selected">.

Thus, the XPath of the select-page item would be:

.//*[@class="nav-pageitem selected"]

 

2) Select the 2nd-page node with XPath Syntax, "following-sibling" 

As the 2nd page is found within the first "li" tag following the current "li" node, the correct XPath would be:

.//*[@class="nav-pageitem selected"]/following-sibling::li[1]

 

3) For clicking on the links, we would need to locate the "a" tag, which means the anchor for linking to the 2nd page.

Now we have the complete XPath:

.//*[@class="nav-pageitem selected"]/following-sibling::li[1]/a

 

4) Replace the auto-generated XPath for the pagination loop with the new XPath 

  • Click on the pagination loop and refer to the setting on the right side, input the new XPath into the textbox for "Single element" 

 

5) Double-check the XPath to make sure it works for other pages

  • Click on the pagination loop
  • Click on the "Click to paginate" action 
  • Observe if the webpage has paginated to the subsequent page successfully
  • Repeat the above step for more pages

Tips!

While XPath is used to locate any particular items on a web page, it is based on the page's source code. Hence the XPath provided in this example will not apply to any other websites most likely but you can always apply the same method for writing the XPath that works for your target website. 

 

Related articles:

Locate elements with XPath 

Load with infinite scrolling/Load more 

Lesson 6: Pagination - Capture data from multiple pages 

Click on a list and capture data from each item page 

Getting data - Capture a list of items 

Deal with AJAX 

  

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline