Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape room listings data from Airbnb
Friday, September 28, 2018
In this tutorial, we are going to show you how to scrape room listings information from Airbnb. Room information details such as price, address, rating, and images etc. can be easily collected by creating a scraping task in Octoparse. Travelers, homeowners, and researchers can use Octoparse to monitor homestay rental prices change and even predict future prices. Homeowners can also keep an eye on their competitors nearby, helping them set a reasonable price.
To be specific, we select "New York, NY, United States", "Dec 1, 2018 - Dec 2, 2018", "2 adults 1 children" and use the result URL for scraping.
You would better customize your demand and use your own customized Airbnb URL since time and location varies. Furthermore, structure and display of airbnb.com might vary depending on your IP, preferred language, display screen, and even browser.
Here are the main steps in this tutorial: [Download task file here ]
1) Go To Web Page - to open the target web page
· Create a task with "Advanced Mode"
Advanced mode supports flexible configuration and complex website.
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on.
2) Set Scroll Down - to load all items from one page
· Open "Workflow" in the top-right corner in Octoparse
We strongly suggest turn on "Workflow" mode to get a better review of what you are doing with your task just in case you mess up with the steps.
· Check the box for "Scroll down to bottom of the page when finished loading"
· Set "Scroll times" as "1", "Interval" as "3", and "Scroll way" as "Scroll down to the bottom of the page"
Octoparse will automatically scroll down to the bottom of the page once (because we have set "Scroll times" as "1"), and load more room listings before starting extracting detail information.
Infinite scrolling is another way to getting more content on the website, usually by scrolling down or click "Load more". To learn how to deal with infinite scrolling for web scraping, here is a related tutorial you might need：
3) Create a "Loop Item" - to loop click into each item on each list
· Click several room titles on the current page
Octoparse will automatically select all similar elements (links to detail page shown as the title of rooms) on the current page.
· Click "Loop click each element" to create a loop item
When running Octoparse, it will click through each link of rooms selected on the listings on the current page.
4) Extract data - to select the data for extraction
After you click "Loop click each element", Octoparse will click the link and jump into detail page.
· Click on the data you need on the page like title, location, price, the rating of the room and such and click "Extract text of the selected element" to extract
· When selecting the rating of the room, click "Extract button outer HTML" instead
· Type the new field name to revise if needed
5) Customize data field using RegEx tool - to reformat the rating of the room (Optional)
In some cases, data needed might hide in HTML (source code of website) instead of displaying intuitively. In this case, we need to extract the rating of the room, but it seems like it cannot be done by extracting the text of the selected element. Therefore, we need to extract outer HTML of rating first, and then reformat it to trim strings we don't need.
· Select "Rating" and click "Customize data field"
· Choose "Refine extracted data"
· Click "Add step" and choose "Match with Regular Expression"
· Choose "Try RegEx Tool"
· Check the box for "Start With" and enter "Rated "
· Check the box for "End With" and enter " out"
· Click "Generate" and "Match"
· Click "Apply" and "OK"
· Click "OK" to save
During your web scraping project, some data might not be the format you wanted. In this case, Octoparse offers 8 data re-format options for you to further process or clean the data extracted into the right format. If you want to learn more how to reformat data by using Regular Expression, here are a related tutorial you might need：
6) Customize data field by modifying XPath - to improve the accuracy of the item list (Optional)
· Select "Loop Item" box
· Select "Variable list" and enter "//div[@class="_fhph4u"]/div//div[@class="_v72lrv"]//a"
There are some situations where you might have to do some modification for better locating the lists on the web page. In this tutorial, we need to locate links of rooms correctly so that we can locate all the rooms after scrolling down. The new XPath "//div[@class="_fhph4u"]/div//div[@class="_v72lrv"]//a" locates room listings through a variable list. Octoparse will use the new XPath to find posts needed correctly.
· Click "OK" to save
If you want to learn more about XPath, here is a related tutorial you might need：
7) Start extraction - to run the task and get data
· Click "Start Extraction" and select "Local Extraction" to start execution
Data will be automatically extracted by Octoparse.
· When the task is completed, you can export the data extracted for further analysis.
Was this article helpful? Feel free to let us know if you have any question or need our assistance. Contact us here !
- Most popular tutorials
- Scrape product data from Walmart
- Scrape product data from Flipkart
- Dealing with Infinitive Scrolling/Load More
- Scrape room listings data from Airbnb
- Scrape real estate data on Realtor.com