Scrape Airbnb Data - Cloud Based ScrapingWednesday, September 28, 2016 7:40 AM
Airbnb is a good website to find you a perfect vacation hotel. In this tutorial, we will help you to learn how to use Octoparse to get hotel info from Airbnb.
The easiest way is to use pre-built task templates of Airbnb. You don't need to configure scraping tasks but just enter keywords/URLs to wait for the data. For further details, you may check it out here: Task Templates
If you want to build the task from scratch, you can continue to read this tutorial. Here is the Airbnb room source link that we will be using as an example.
Here are the main steps in this tutorial [Download task file here]
1. Go to Web Page - open the target website
2. Build a Loop Item - click each hotel link
3. Extract data - scrape information from the detail page
4. Modify the XPath of data fields
5. Create pagination- scrape data from multiple pages
6. Modify the XPath of Pagination
7. Run your task - get data you want
1) Go to Web Page - open the target website
- Enter the URL on the home page and click Start
2) Build a Loop Item - click each hotel link
- Select the first two blocks to detect all blocks
- Click on "Loop click each URL" to enter the detail page
A Loop Item will be created and Octoparse opens the first hotel page automatically.
3) Extract data from the detail page
- Select any info you want and click on Extract the text of the element
- Select Add customer field -> Page-level data -> Page URL if you would like to pull the page URL from the current page
- Double click the data field to modify the name
4) Modify the XPath of data fields
The Airbnb page design is tricky and auto-generated XPaths usually does not for all the pages. No worries! We have prepared everything you need. You can just use the element XPath provided below.
- Switch to Vertical View - Vertical View can help modify multiple data fields easily
- Double click on the XPath to modify it
- Input the new XPath to it
Here are Xpaths for different fields of Airbnb pages:
Hotel Title: //h1
Number of review: //button[contains(@aria-label,'Rate')]
Review rating: //button[contains(@aria-label,'Rate')]/../preceding-sibling::span
Number of guests: //span[contains(text(),'guest')]
Number of bedrooms: //span[contains(text(),'bedroom')]
Number of bath: //span[contains(text(),'bathroom')]
Number of beds: //span[contains(text(),'bed')][not(contains(text(),'room'))]
5) Create pagination
- Click on Go to Web Page to open the listing page again
- Select the next page button (">") at the bottom of the main page
- Choose Loop click single element from the Tips
A Pagination will be created in the workflow
- Drag the workflow to the right position
6) Modify the XPath of Pagination and Loop Item
The auto-generated XPath does not always work well. In this case, we will need to modify the XPath of the Pagination and Loop Item
- Click on Pagination
- Enter the XPath: //*[@aria-label='Next']
- Click on Loop Item
- Change Loop Mode to Variable list
- Enter XPath: //a[contains(@aria-labelledby,'title')]
- Click Apply to save
XPath plays an important role in locating the correct element in Octoparse. To learn more about it, please refer to the following tutorial:
The next page is loaded with AJAX, so we need to add AJAX timeout to the "Click to Paginate" action.
- Click on Click to Paginate
- Go to the Options
- Tick Load with AJAX
- Set up the AJAX timeout as 5-10s
7) Run your task - get data you want
- Click on the upper left side
- Select Run on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)
Here is the sample output.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!