Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape hotel data from Tripadvisor

Monday, October 29, 2018

 

In this tutorial, we are going to introduce how to scrape information from Tripadvisor.com.

To follow through you might want to use the URL in this tutorial:

https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html. 

We will enter each hotel detail page and scrape the hotel title, location, price, and rating.

 

This tutorial will also cover: 
   · Reformat the star rating into numerals with RegEx tool in Octoparse

 

Main steps in the tutorial: [Download demo task file here ]

1) "Go To Web Page" - to open the targeted web page

2) Create a pagination loop - to scrape all the results from multiple pages

3) Create a "Loop Item" -  to loop click into each item on each list

4) Extract data - to select the data for extraction

5) Customize the data field using RegEx tool - to reformat rating data (Optional)

6) Save and start extraction - to run the task and get data

 

 

 

 

 

1) "Go To Web Page" - to open the targeted web page

• Create the task with "Advanced Mode".

• Paste the URL into the "Extraction URL" box and click "Save URL" to move on.

 

Because of the cookie setting of Tripadvisor, we need to configure the filters in Octoparse.

 Select "Check-in" date in the built-in browser and click "Click Element" on the "Action Tips".  

• Repeat the actions to configure the "Check-out" date and "Guest Information".

Now, we can have the result page we need.

 

 

 

2) Create a pagination loop - to scrape all the results from multiple pages

• Scroll down and click the "Next Page" button on the webpage

• Click "Loop click next page" on "Action Tips"

As TripAdvisor loads the content with AJAX, we should set up AJAX Load for the “Pagination” action.

• Uncheck "Auto retry when no response"

• Check "Load the page with AJAX"

• Set up "AJAX Timeout"

 

 

 

 

 

3) Create a "Loop Item" -  to loop click into each item on each list

We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.

• Click "Go To Web Page" in the workflow.

• Delete the three "Click item" actions

Octoparse will send the saved cookie to the website at loading, so we can open the result page directly. As the Tripadvisor has already "remembered" us, now there’s no need to keep these actions.

• Select the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

 

 

Now, let’s build the loop item

• Click the title of the first item on the listing page

• Click "Select All" on "Action Tips"

• Select"Loop click each element"

 

 

 

4)  Extract data - to select the data for extraction

• Click the information you need on the page

• Select "Extract data" in the "Action Tips"

• Rename the fields by selecting from the pre-defined list or inputting on your own

 

Tips!

When you click on the rating of the listing, choose "Extract button outer HTML". The data extracted needs to be processed further with Regular Expression. See how it's done in Step 5.

 

 

 

5) Customize the data field using RegEx tool - to reformat rating data (Optional)

When the data we want is not shown as readable text on the web page, we need to extract its source code (HTML) at first, and then process the extracted source code into our desired format.

• Select the "Rating" field to be modified

• Click "Customize data field"

• Select "Refine extracted data", click"Add step", and then select "Match with Regular Expression"

• Select "Try RegEx Tool"

• Check the box for "Start With" and enter "alt="

• Check the box for "End With" and enter "star rating"

• Click "Generate" and "Match"

• Click "Apply" and "OK"

• Click "OK" to save

 

 

 

6)  Save and start extraction - to run the task and get data

• Click "Start Extraction"

• Select "Local Extraction" to run the task on your computer 

 

 

Here is the sample output.

 

Was this article helpful? Contact us  any time if you need our help!

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png