Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Lesson 6: Pagination - Capture data from multiple pagesTuesday, December 24, 2019
The latest version for this tutorial is available here. Go to have a check now!
Now you've learned how to capture a list of items and capture data from each item page , you are ready to extend the scraping to capture data from multiple pages. In this lesson, we will show you how to add a pagination action by clicking on the "Next" button and extract from all available pages.
1) Set up pagination for extracting data from the individual item page [Download task file ]
Once you’ve created a task for extracting specific data fields from the individual item page, the workflow should have a "Go To Web Page" step and a "Loop Item" step to loop click each item link and further capture the designated data fields from each item page.
As the "Next" button is always located on the list page, click the "Go To Web Page" step if you are not already on the list page.
We will use URL: https://www.yelp.com/search?find_desc=Takeout&find_loc=new+york%2C+NY%2C+United+States&ns=1 for the example below.
Create a pagination loop
- Locate the "Next" button and click on it
- On "Action Tips", select "Loop click next page". Notice a "Click to paginate" step is automatically generated and added to the workflow.
- Rearrange the workflow steps by dragging & dropping the "Loop Item" to the inside of the "Pagination" loop, position right before the "Click to paginate" step.
1. In what order does Octoparse execute each step?
Octoparse executes steps from top-down. And for nested loop items, Octoparse executes inner "Loop Item" first and outer "Loop Item".
Let's look at the workflow from the current task as an example. Here's the order Octoparse would execute the steps in the workflow,
1 - "Go To Web Page" for loading the target webpage
2 - "Click Item" for clicking the first item
3 - "Extract Data" for capturing data on the first item page
4 - "Loop Item" for repeating the "Click Item" and "Extract Data" for all items from the first list page
5 - "Click to paginate" for clicking on the "Next" button once the scraping is done for the first page
6 - "Pagination" loop for repeating the "Click to paginate" step
Set up 2-4 second AJAX timeout for "Click to paginate" step
- Select "Click to paginate" step
- Select "Load the page with AJAX"
- Select 2-3 second AJAX timeout
- Click "OK" to save any changes
[Do not set up AJAX timeout if no AJAX technique is used for the item]
When should I set up for AJAX timeout?
AJAX technique is commonly used for elements that need to be clicked, such as "Click to view email", "Next", etc. In this case, it is critical to set up for AJAX timeout or the workflow is not going to execute properly. To tell whether there’s AJAX or not, you can try to observe if the web page updates content without reloading, ie. without signs like or , then AJAX technique is very likely used on the item.
2) Set up pagination for extracting a list of items [Download task file ]
If your task is set up for capturing a list of items (See how to capture a list of items in lessons 4 ), your workflow should look similar to the one below, consisting of a "Go To Web Page" step and a "Loop Item" to loop through each item on the list.
Now, locate the "Next" button and click on it. On "Action Tips", select "Loop click next page" to create the pagination loop.
Rearrange the loops in the workflow if the pagination loop is created below the extraction data loop.
Once the pagination loop is created, the correct workflow should be like this：
- Most popular tutorials
- Use lists to extract
- Set up proxies
- Scrape data via Google Searching
- Extract data from source code
- How to export extracted data to a database?