Scrape Data from Multiple Web Pages (Example: Medline)
Wednesday, January 11, 2017 3:08 AM
Octoparse enables you to scrape data from multiple web pages of websites. There are several pagination structures implemented by various websites. One of the most common pagination structures is numbered/numeric pagination, or page numbering.
In this web scraping tutorial we will scrape general anesthesia data from www.medline.com website which displays the products with multiple web pages (numbered pagination).
The website URL we will use is https://www.medline.com/category/General-Anesthesia/Z05-CA14_01_15.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Tripadvisor hotel reviews. (Download my extraction task of this tutorial HERE just in case you need it.)
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html )
Step 3. Move your cursor over the section with similar layout.
Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first item has been added to the list. ➜ Click "Continue to edit the list".
Click the second item ➜ Click "Add current item to the list" again. Now we get only 4 items from the page. ➜ Click "Continue to edit the list". ➜ Click the last item ➜ Click "Add current item to the list" again. Now we get all the items from the page.
Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
Step 4. Scrape data from multiple web pages
Click on the numbered pagination link. ➜ Click on the Page 2 ➜ Choose "Click an item".
Drag a "Loop Item" into the workflow Between two "Click Item" actions ➜ Drag the second "Click Item" into the "Loop Item" box ➜ Click the "Loop Item" box.
You can use use "following-sibling::" to write the correct XPath for selecting the page link next to the current web page and thus scrape multiple web pages with numbered pagination.
Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option ➜ Enter the correct XPath into the text box ➜ Click "Save".
The correct XPath is .//div[@bottompagination=""]/ul/li/span[@class='selectedPage']/../following-sibling::li[1]/a.
Step 5. Move your cursor over the section with similar layout.
Click "Click Item" and wait until the web page about Anesthesia Circuits has completely loaded.
Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first item has been added to the list. ➜ Click "Continue to edit the list".
Click the final item ➜ Click "Add current item to the list" again. Now we get all the items from the page ➜ Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
Step 6. Drag the third "Loop Item" box before the "Click Item" action of the second "Loop Item" box so that we can grab all the information from multiple web pages.
Step 7. Go through the workflow.
We would select the item that has the full information since sometimes the item displayed will not include all the content we needed, and thus we can extract the detail information we want. So we check the workflow.
Step 8. Move your cursor over the section with similar layout to extract the ordering information of the material.
Click the first item➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first item has been added to the list. ➜ Click "Continue to edit the list".
Click the second item ➜ Click "Add current item to the list" again. Now we get all the items from the page ➜ Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the detail information from these materials.
Step 9. Extract the detail information.
Extract the product type. ➜ Click the product type ➜ Select “Extract text”. Other contents can be extracted in the same way.
After all the content have been selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.
Step 10. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the AJAX content from the pages.
Go to Web Page ➜ The first Loop Item box ➜ Click Item ➜ The second Loop Item box ➜ The third Loop Item box ➜ Click Item ➜ The fourth Loop Item box ➜ Click Item ➜ Extract Data ➜ Click Item.
Step 11. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 12. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Note:
1. Some data fields in the task are added afterwards and not shown in the GIF file.
2. If there are some missing values for some data fields in the output, you can figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. Here I replace all the SPAN tags with * tags for all the data fields. Click "Save" to save the configuration. You can follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.
How to use Firebug and Firepath?
Modify XPath Manually in Octoparse
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Author's Picks
Scraping Hotel Reviews from multiple web pages (Example:Tripadvisor.com)
Scrape Data from Yellowpages.com
Scraping Online Dictionary-Merriam-Webster
Scraping Product Detail Pages from eBay.com
Scraping Data from Walmart.com