Scrape Data from Multiple Web Pages (Example: Medline)

Wednesday, January 11, 2017 3:08 AM

 

Octoparse enables you to scrape data from multiple web pages of websites. There are several pagination structures implemented by various websites. One of the most common pagination structures is numbered/numeric pagination, or page numbering. 

 

In this web scraping tutorial we will scrape general anesthesia data from www.medline.com website which displays the products with multiple web pages (numbered pagination). 

 

The website URL we will use is https://www.medline.com/category/General-Anesthesia/Z05-CA14_01_15.

 

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Tripadvisor hotel reviews. (Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1. Set up basic information.

 

 

Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".

 

 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

 

(URL of the example: https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html )

 

Step 3. Move your cursor over the section with similar layout.

 

Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first item has been added to the list. ➜ Click "Continue to edit the list".

Click the second item ➜ Click "Add current item to the list" again. Now we get only 4 items from the page. ➜ Click "Continue to edit the list". ➜ Click the last item ➜ Click "Add current item to the list" again. Now we get all the items from the page. 

Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

 

Step 4. Scrape data from multiple web pages

 

Click on the numbered pagination link. ➜ Click on the Page 2 ➜ Choose "Click an item".

Drag a "Loop Item" into the workflow Between two "Click Item" actions  Drag the second "Click Item" into the "Loop Item" box ➜ Click the "Loop Item" box.

 

You can use use "following-sibling::" to write the correct XPath for selecting the page link next to the current web page and thus scrape multiple web pages with numbered pagination.

 

Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option ➜ Enter the correct XPath into the text box ➜ Click "Save". 

The correct XPath is .//div[@bottompagination=""]/ul/li/span[@class='selectedPage']/../following-sibling::li[1]/a.

 

Step 5. Move your cursor over the section with similar layout.

 

Click "Click Item" and wait until the web page about Anesthesia Circuits has completely loaded.

 

Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first item has been added to the list. ➜ Click "Continue to edit the list".

Click the final item ➜ Click "Add current item to the list" again. Now we get all the items from the page ➜ Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

ml-5

 

Step 6. Drag the third "Loop Item" box before the "Click Item" action of the second "Loop Item" box so that we can grab all the information from multiple web pages.

 

 

Step 7. Go through the workflow.

 

We would select the item that has the full information since sometimes the item displayed will not include all the content we needed, and thus we can extract the detail information we want.  So we check the workflow.

 

Step 8. Move your cursor over the section with similar layout to extract the ordering information of the material.

 

Click the first item➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first item has been added to the list. ➜ Click "Continue to edit the list".

Click the second item ➜ Click "Add current item to the list" again. Now we get all the items from the page  Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the detail information from these materials.

 

Step 9. Extract the detail information.

 

Extract the product type. ➜ Click the product type ➜ Select “Extract text”. Other contents can be extracted in the same way. 

After all the content have been selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.

 

Step 10. Check the workflow.

 

Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the AJAX content from the pages.

Go to Web Page ➜ The first Loop Item box ➜ Click Item ➜ The second Loop Item box ➜ The third Loop Item box ➜ Click Item ➜  The fourth Loop Item box ➜ Click Item ➜ Extract Data ➜ Click Item.

 

Step 11. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.

 

 

 

Step 12. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

 

Note:

1. Some data fields in the task are added afterwards and not shown in the GIF file.

2. If there are some missing values for some data fields in the output, you can figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.

Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. Here I replace all the SPAN tags with * tags for all the data fields. Click "Save" to save the configuration. You can follow this tutorial to modify XPath expressions in Octoparse.

Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.

How to use Firebug and Firepath?

Getting started with XPath 1

Getting Started With XPath 2

Modify XPath Manually in Octoparse

 

 

 

 

 

Author: The Octoparse Team

 

 

 

 

 

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today!

 

Author's Picks

 

Scraping Hotel Reviews from multiple web pages (Example:Tripadvisor.com)

Scrape Data from Yellowpages.com

Scraping Online Dictionary-Merriam-Webster

Scraping Product Detail Pages from eBay.com

Scraping Data from Walmart.com

 

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf

 

Request Pro Trial Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.