Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape hotel data from Tripadvisor
Thursday, September 20, 2018
According to The World Tourism Economic Trend Report (2018)，the total number of tourist across the world will reach 12.67 billion approximately in 2018. There is no doubt that there is a huge demand for hotels in the coming years. To win the game, no matter the veteran hotels or emerging intruders, it is necessary for them to collect massive raw data and analyze them properly.
In this tutorial, we will show you how to scrape hotel raw data from tripadvisor.com with our web scraping tool.
To follow through you might want to use the URL in this tutorial:
Main steps in the tutorial: [Download demo task file here ]
1) "Go To Web Page" - to open the targeted web page
· Create the task with "Advanced Mode".
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2) Create a pagination loop - to scrape all the results from multiple pages
· Click the date and then the "click element" bottom on the "Action Tips" panel
· Move the mouse to the workflow panel and right-click to delete the steps
· Scroll down and click "Next Page"
· Click "Loop click next page" on the operation panel and pagination step appears on the workflow.
3) Create a "Loop Item" - to loop click into each item on each list
· Click the product item and choose "Select all" on the "Action Tips" panel
· Click "Loop click selected page" and "Loop item" will be auto-created on the workflow
4) Extract data - to select the data for extraction
· Click the information you need on the page
· Select "Extract data" on the "Action Tips" panel.
5) Customize the data field using RegEx tool - to reformat rating data (Optional)
In some cases, the data you need might hide in the HTML with extra strings that you don't need. For example, we need to extract the star rating but it seems like it cannot be done by clicking to extract. In this case, we would need to extract the HTML first and then reformat the data extracted in order to trim the strings we don't need. There are the main steps in this example to process the data extracted.
1. Click the "Rating" bar and then click "Customize data field" button.
2. Select "define data extracted" and then "Extract outer HTML, including source code, text with format and image".
3. Click "OK" and we can see the data we extract is the source code of "Rating" information.
4. Click "Customize data field" button again to reformat the data.
5. Select "Refine extracted data" and click "Add step"
6. Select "Match with Regular Expression" and click "Try RegEx Tool"
7. Star with "alt="" and end with "of 5", click "generate", then "Match", the result will be auto-generated in the matches field.
6) Save and start extraction - to run the task and get data
· Click "Start Extraction" and select "Local Extraction" to extract.
· When the extraction process complete, data can be exported.
Here is the result you might want to have.
Was this article helpful? Contact us any time if you need our help!
- Most popular tutorials
- Scrape product data from Walmart
- Scrape product data from Flipkart
- Dealing with Infinitive Scrolling/Load More
- Scrape room listings data from Airbnb
- Scrape real estate data on Realtor.com