Scrape Data from Websites with Pagination (Query Strings) (1)Wednesday, September 28, 2016 7:54 AM
What it is？
Pagination action is used when the content we want to scrape spans different pages of a website. Octoparse mimics human browsing behaviors, so just as you would click to the next page as you browse through a website, Octoparse does the same when you use pagination feature.
Furthermore, Query String Pagination is one of the most common ways for us to flip through pages. It is the simple URL with query string parameter.
When do you want to use it?
If you would be extracting data from more than one page, then use pagination to enable page flipping.
There are mainly two kinds of query string pagination.
The first one shown below has the “Next page” button, while the latter does not have such feature.
In this tutorial, I will take securitysystemsnews.com for example to show you how to scrape data from websites with Pagination - “Next” button.
(Download my extraction task of this tutorial of scraping data with pagination HERE, just in case you need it.)
Step 1. Start your task and set up basic information.
- Click “Quick Start”
- Choose "New Task (Advanced Mode)"
- Complete the “Basic Information”
- Click “Next”
Step 2. Navigate to your target webpage
- Enter the target URL in the built-in browser
(URL of the example: http://www.securitysystemsnews.com/topic/Commercial-and-Systems-Integrators )
- Click “Go” icon to open the webpage
Step 3. Set up pagination
To extract data from websites with query string pagination, you need to add a page navigation action for pagination.
- Click on “Next” to the right of page numbers
- Choose “Loop Click Next Page”.
( This will tell Octoparse to click open each page for more extraction actions. )
Step 4. Create a list of items
Move your cursor over the article with similar layout, where you would extract the content of the article.
- Click any where on the first section on the web page
(Make sure the outlined box contains the data to be extracted)
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
(Now, the first item has been added to the list, we need to finish adding all items to the list)
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
(Now we get all the sections added to the list with similar layout)
- Click “Finish Creating List”
- Click “loop". This action will tell Octoparse to click on each section on the list to extract the selected data
Step 5. Select the data to be extracted
- Select the data to be extracted
- Right click on the title of the first section
- Select “Extract text”
- Follow the same steps to extract other data fields
Step 6. Re-name the data fields
All the content will be selected in Data Fields.
- Click the “Field Name” to modify.
Step 7. Adjust the relative loop sequence
- Drag the second “Loop Item” box before the “Click to paginate” action of the “Cycle Pages” box in the Workflow Designer.
（So that we can grab all the elements of sections from multiple pages.）
- Click “Save"
Step 8. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
(Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress)
Step 9. Check the data and export
- Check the data extracted
( The data extracted with pagination will be shown in “Data Extracted” pane. ）
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Now you've learned how to scrape data from websites with pagination. Let’s look into how pagination works with this example [link to the case study example].
Or, learn more about pagination related topics:
- Pagination Loop issue: The extraction stops after 3 pages
- Create A Loop For Pagination Manually
- Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!