Scrape Scientific America with Pagination IssueTuesday, March 07, 2017 8:50 PM
In this web scraping tutorial we will show you how to deal with a pagination issue. A pagination problem occurs when a task is not flipping through the webpages properly, leading to the problem of scraping incorrect page numbers or scraping the same page repeatedly. Whenever a pagination issue is observed, it is very likely due to the fact that the auto-generated XPath for 'Next' is not accurate. Here we'll show you how this can be resolved using an example with data from Scientific America.
The website URL we will use is https://www.scientificamerican.com/podcast/60-second-science/?page=1
The data fields include article title, content.
You can directly download the task (The OTD. file) to begin collecting the data. Or you can follow the steps below to make a scraping task to scrape the latest tech news articles from Scientific America. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1 - Make a scraping task in Octoparse
Step 1. Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" to open the webpage.
(URL of the example: https://www.scientificamerican.com/podcast/60-second-science/?page=1
Step 3. Right click the ”Next“ button on the bottom of the web page. ➜ Click the Advanced Options and select the option 'Loop click Next page'.
Here we should note that if the 'Cycle Pages' is nested in 'Loop Item'.
Step 4. Right click the first abstract area. ➜ Create a list of target areas with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".
Note: When it doesn't automatically select the target area, you can click the “Expansion Area” Button on the upper right corner to adjust your target area.
Then the first article can be added to the list. ➜ Click "Continue to edit the list".
Right click the second abstract area ➜ Click "Add current item to the list" again (Now we get all the abstract with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the abstract data.
Here, we should note that when we add the second article to the list, Octoparse has helped us to add all of the rest articles both under Europe and Asia to the "Loop Item" box as we observe the item list.
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 5：Extract the content of the article.
Right click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
You could change the XPath of the Single Element - Next in the Loop Item.
Go to "Loop mode" ➜ Select "Single element" ➜ Modify the XPath of the single element as : //span[text()='Next'] ➜ Click "Save"
Step 7. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the content from the pages.
Go to Web Page ➜ Cycle Pages ➜ The Loop Item box ➜ Extract Data ➜ Click to Paginate.
Step 8. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After the task has been perfectly configured following the steps above, you can schedule the task to run on Octoparse's cloud platform.
Step 1. Find out the task you've just made in "My Task" ➜ Right Click the task ➜ Right Click the "Schedule Cloud Extraction"➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
After you click 'OK' in the Cloud Extraction Schedule window, the task will be added to the waiting queue and you can check the status of the task.
We can designate any time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Schedule window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.