Web Crawling Case Study | Scraping ASTA with Pagination (2) - No "Next Button" FoundWednesday, April 05, 2017 2:44 AM
In the tutorial Scraping from multi-pages: pagination without "Next" button, we have learnt how to flip pages without a "Next" button.
In this tutorial, I will take ASTA website for an instance to show you how to scrape data from websites with pagination without "Next Button" step by step.
List features covered
- Building a list
- Modify XPath
Now, let's get started!
Step 1. Set up basic information and navigate to the target website
- Before you scrape data with pagination, complete basic information
- Enter the target URL in the built-in browser (URL of the example: http://web.asta.org/iMIS/ASTA/Directory?navItemNumber=11304 )
- Click "Go" icon to open webpage
Step 2. Find the pages to scrape
In this website, the searching content will not be displayed until you click an item to prompt searching by yourself.
- Click on the “Find” button
- Click “Click an item”
Step 3. Set up Pagination
- Drop a “Loop” item into Workflow designer.
- Choose a “Loop Mode” under “Advanced Options”.
- Select “Single Element” option.
Step 4. Modify XPath to locate next page
- Make sure you locate the first page so that you could get the data from all the pages.
- Then paste the Xpath : //div/table/tbody/tr[@class='cssPager']/td/table/tbody/tr/td/span/../following-sibling::td/a in the “Single Element” text box.
- Click “Save”.
Step 5. Click Items in the loop to scrape data with pagination
- Drop a “Click Item” into the “Loop item”
- Choose “Click Loop items” under “Advanced Option”
- Click “Save”.
Now you’ve configured pagination scraping.
Step 6. Create a list of items
Move your cursor over the article with similar layout, where you would extract the content of the article.
- Click any where on the first section on the web page
- Click “Expand the selection area” to the point where the outlined box includes all the content you want to scrape.
If The selection had not been identified properly in the first place.
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
Now we get all the sections added to the list with similar layout
- Click “Finish Creating List”
- Click “loop”， this action will tell Octoparse to click on each section on the list to extract the selected data
Step 7. Select the data to be extracted and Rename data fields.
- Click the data field “Full Name”.
- Select “Extract text”
- Follow the same steps to extract the other data.
- Rename the any field names if necessary.
- Click "Save"
Step 8. Re-order workflow
Notice that the loop action for data extraction is positioned outside of the loop for pagination. This doesn’t make sense, right? Since we want to extract from each page before turning to the next page. So, we’ll need to manually drag the data extraction loop to the inside of the pagination loop, position it right before “Click to paginate” action in the workflow designer.
Now, look at the workflow we created, extract and turn page, then loops back to extract and turn page, exactly what we want.
Step 9. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress
Step 10. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Now you've learned how to flip through pages to scrape data without "Next" button. Let’s look into how pagination works with this example.
Now check out similar case studies:
- Web Scraping Case Study | Security System News
- How to Scrape WordPress Posts
- Web Scraping - Scraping Facebook That Required Login with Octoparse
Or, learn more about pagination related topics:
- Pagination Loop issue: The extraction stops after 3 pages
- Create A Loop For Pagination Manually
- Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!