undefined

Web Crawling Case Study | Crawling Data from Booking.com

Monday, May 15, 2017 8:29 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

In this tutorial, we are going to show you how to scrape hotel information on Booking.com.  

Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Booking Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templatesexternal-link-symbol-1.png

1.png

If you would like to know how to build the task from scratch, you may continue reading the following tutorial.

 

You may need this URL to follow through in the tutorial: 

 

We will scrape data such as hotel names, images, addresses, descriptions, scores, reviews, and star ratings with Octoparse.

 

Here are the main steps in this tutorial:[Download the demo task click here]

1. Go to the web page -  open the target web page

2. Auto-detect the web page - create a workflow

3. Click into each detail link to scrape more information

4. Extract Data - extract data on the detail pages

5. Set up wait time - slow down the scraping speed

6. Modify the XPath of Pagination

7. Start extraction- run the task and get data

 

 

1. Go to the web page -  open the target web page

 

2.png

 

2. Auto-detect the web page - create a workflow

 

  • Click the "Auto-detect web page" and uncheck the "Add a page scroll" to create a workflow 
  • Adjust the order for the fields as you want 
  • Delete and rename the field

With Octoparse 8.4, it is quite convenient to delete the fields that you don't want together after auto-detection.  Click the icon 3.png to switch to vertical view to delete and rename the fields. Note that you need to double click on the field name to rename it. 

If the data you need can all be scraped from the listing page, you can just jump to Set up wait time to slow down the scraping speed. If you want to click on each detail link to get more information, please follow the next step.

 

3. Click into each detail link to scrape more information

  • Choose to “Click on link(s) to scrape the linked page(s)” on the Tips panel
  • Select "Click on an extracted data field" and select the one you want to click on from the drop-down menu (you can confirm if it's the correct link in the Data Preview)
  • Click on "Confirm"

 

4. Extract Data - extract data on the detail pages

  • Select the data you need and click "Extract the text of the element" 
  • Double click on the field name to rename it if needed

 

rename.png

 

5. Set up wait time - slow down the scraping speed

Booking might block your IP if you scrape it too much, therefore we need to control the scraping speed.

 

4.png

 

6. Modify XPath of the Pagination

The auto-generated XPath of Pagination does not always locate the next page button, so we need to modify the XPath.

  • Click on Pagination
  • Replace the XPath with //button[@aria-label="Next page"]
  • Click Apply to save

 

pagination.png

 

Tip!

 Check out this tutorial to learn more about Path: What is XPath and how to use it in Octoparse

 

 

7. Start extraction- run the task and get data

 

 

5.png

 

Here is the sample output. 

google_2.png

 

Is this article helpful? Contact us anytime if you need our help!

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline