If you are seriously looking into scraping a website, chances are you would want to navigate through the different pages of the website and extract data from each one of them. The first step, however, is to identify the kind of pagination you are dealing with and work from there. A few examples are:
- Paginate using the "Next" button
- Paginate without the "Next" button
- Paginate with infinitive scrolling
- Paginate using the "Load more" button
In this tutorial, we will focus on how to create a pagination action when there is no next page button on the page. More specifically, one that requires clicking the numbered links when you want to turn the page, like the ones below.
Now, let's explore the various ways you can create a pagination action with no next page button in Octoparse.
1. Create pagination with Auto-detect
If you are building a new task with Webpage Auto-detect, Octoparse automatically scans the web page for web data and pagination links.
If you have "Auto-detect" enabled in Settings, the auto-detect process will be initiated automatically.
If Octoparse detects any pagination links on the web page, pagination options will be provided in the Tips panel upon completion of the Auto-detect Process. You can click "Check" to see the link detected by Octoparse or click "Edit" to edit the link if it has not been detected correctly.
As we all know, web pages come in many different forms. There will be times when Auto-detect fails to detect pagination links or have the wrong links detected. In this case, you can turn to one of the solutions below.
2. Use "Batch Generate" to create URLs for all pages
An alternative but very effective way to approach scraping multiple pages of a website is to first collect the URLs of all the pages you would need to scrape and build a task using the list of URLs collected.
Take a closer look at the web page URLs for the different pages, do you notice something like this?
If you are seeing a similar pattern to the example above, with only the page number changing in the URLs of the different pages, you can easily batch generate all the page URLs and scrape as many pages as needed. Once you have the links generated, Octoparse will go on to scrape all the pages automatically.
3. Create pagination manually
Even if the Auto-detect fails to work and page URLs are not showing a pattern, you can still create a pagination action manually.
It will be a two-step process. First, you are going to write/find the XPath of the page element that takes you to the next page (e.g., if you are on page 1, then you would want to click page 2; if you are on page 2, then you would want to click page 3, so on and so forth). Second, you would revise the XPath of the "Pagination" action of the workflow in Octoparse. Sounds complicated? No worries, let's dive into an example.
XPath knowledge is not mandatory but is extremely helpful to create a task that does exactly what you need in Octoparse. Check out What is XPath and how to use it in Octoparse to learn more about how to use XPath to create the perfect web scraper.
http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text=You may need this example link to follow through:
Step1. Click the pagination part and click"Loop click single element"
Step2. get the right Xpath
1) Copy and paste the current page URL (http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text=) to your own browser (e.g. Chrome). Now, you need to download a browser add-on tool called the XPath Helper.
2) In your browser, click to launch the XPath Helper.
3) Locate the page numbers on the web page, right-click the page-number link "1" and select the Inspect option.
4) By now, your screen should look like this. The highlighted code corresponds to the link on page 1.
5) Next, right-click the highlighted code, select "Copy", then "Copy XPath". You have just now copied the XPath of page-number link "1".
This is the XPath you've copied:
6) Looking at the source code you can find that page-2 is located one line below the page-1 element.
Using XPath Syntax "following-sibling" which tracks for the next following node down the line, you can modify the copied XPath for the page-1 element to one that tracks the page following it (page-2 in this case).
So the correct XPath that is always to go locate the next page following the current page is:
*Note: By adding "/following-sibling::a" to the end of the previous XPath, it now looks for the first href element (a) following the first-page element.
Enter the correct XPath to the Query section of the XPath Helper, you can see that page "2" is correctly located using the XPath.
Step3. Revise the existing XPath with the new XPath
Copy and paste the new Xpath under the pagination, then click "Apply" to confirm.
If you're still having trouble dealing with pagination without the next button, submit a ticket to our support team! We're here to help.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.