Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we are going to introduce how to scrape information from Tripadvisor.com.
To follow through you might want to use the URL in this tutorial:
We will enter each hotel detail page and scrape the hotel title, location, price, and rating.
This tutorial will also cover:
· Reformat the star rating into numerals with RegEx tool in Octoparse
Main steps in the tutorial: [Download demo task file here ]
1) "Go To Web Page" - to open the targeted web page
• Create the task with "Advanced Mode".
• Paste the URL into the "Extraction URL" box and click "Save URL" to move on.
Because of the cookie setting of Tripadvisor, we need to configure the filters in Octoparse.
• Select "Check-in" date in the built-in browser and click "Click Element" on the "Action Tips".
• Repeat the actions to configure the "Check-out" date and "Guest Information".
Now, we can have the result page we need.
2) Create a pagination loop - to scrape all the results from multiple pages
• Scroll down and click the "Next Page" button on the webpage
• Click "Loop click next page" on "Action Tips"
As TripAdvisor loads the content with AJAX, we should set up AJAX Load for the “Pagination” action.
• Uncheck "Auto retry when no response"
• Check "Load the page with AJAX"
• Set up "AJAX Timeout"
3) Create a "Loop Item" - to loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.
• Click "Go To Web Page" in the workflow.
• Delete the three "Click item" actions
Octoparse will send the saved cookie to the website at loading, so we can open the result page directly. As the Tripadvisor has already "remembered" us, now there’s no need to keep these actions.
• Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
Now, let’s build the loop item
• Click the title of the first item on the listing page
• Click "Select All" on "Action Tips"
• Select"Loop click each element"
4) Extract data - to select the data for extraction
• Click the information you need on the page
• Select "Extract data" in the "Action Tips"
• Rename the fields by selecting from the pre-defined list or inputting on your own
When you click on the rating of the listing, choose "Extract button outer HTML". The data extracted needs to be processed further with Regular Expression. See how it's done in Step 5.
5) Customize the data field using RegEx tool - to reformat rating data (Optional)
When the data we want is not shown as readable text on the web page, we need to extract its source code (HTML) at first, and then process the extracted source code into our desired format.
• Select the "Rating" field to be modified
• Click "Customize data field"
• Select "Refine extracted data", click"Add step", and then select "Match with Regular Expression"
• Select "Try RegEx Tool"
• Check the box for "Start With" and enter "alt="
• Check the box for "End With" and enter "star rating"
• Click "Generate" and "Match"
• Click "Apply" and "OK"
• Click "OK" to save
6) Save and start extraction - to run the task and get data
• Click "Start Extraction"
• Select "Local Extraction" to run the task on your computer
Here is the sample output.
Was this article helpful? Contact us any time if you need our help!