Web Crawling Case Study | Crawling Data from Booking.comMonday, May 15, 2017 8:29 AM
Welcome to Octoparse case tutorials！In this tutorial, I will elaborate the steps of capturing hotel information from Booking.com.
List features covered
Now let's get started!
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
2. Navigate to the target website
- Enter the target URL in the built-in browser (copy the example URL here)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Once the webpage finishes loading, we will notice that the hotel information blocks are arranged in very similar section layout. That means we can build a list to add all of these hotel information blocks and then configure Octoparse to extract data from each section.
- Move your cursor over the hotels with similar layout, where you would extract the information from
- When the highlighted section covers all the information of the first hotel, click it
- Click "Create a list of items"
- Click "Add current item to the list"
Now, the first item has been added to the list, we need to finish adding all the items to the list.
- Click "Continue to edit the list"
- Click the second section with similar layout
- Click "Add current item to the list" again
Now we get all the sections added to the list.
- Click "Finish Creating List"
- Click "Loop", which means Octoparse would go through the list to extract data
Step 4. Select the data to be extracted and Rename data fields.
In this step, we will begin extracting data from the loop list of hotel information sections. By navigating to the "Extract Data" action and click it, you will notice that the first information section is outlined with green dotted line. That means we need to extract data just within this section by following the steps below. Note that the extraction action we will be setting up for this section is going to apply to the rest of the list. Say we want to capture the hotel name, location, and rating.
- Click the hotel name
- Select "Extract text"
- Follow the same steps to extract the other data
- Rename any field if necessary
- Click "Save"
Step 5. Set up pagination
Now we need to flip through multiple web pages to extract as many data as possible by setting up pagination action.
- Click on "Next page" to the right of page numbers
- Choose "Loop Click Next Page"
This will tell Octoparse to open each page for more extraction actions.
Step 6. Modify XPath to locate "Next page"
In this step, we need to locate the pagination button manually by modifying its XPath as the Xpath for "Next page" is not the same in different pages (Click here to know more about XPath). Here we use XPath tool in Octoparse to automatically generate the XPath of "Next page".
- Open the XPath tool
- Click "Item Text" and input the text of the pagination button, which in this case is "Next page", into the box
- Click "Generate"
- Copy the generated XPath, and paste it in the "Single element" Text Box
- Click "Save"
Step 7. Set AJAX Timeout
As this web page uses AJAX to load more pages, we need to set AJAX timeout for the action "Click to paginate".
- Navigate to "Click to Paginate" action
- Tick "AJAX Load" checkbox
- Set an AJAX timeout of 15 seconds
- Click "Save"
Note: Here, we should set an AJAX timeout at least 10 seconds, since a shorter AJAX timeout may cause Octoparse to crawl the previous page once again, when it's told to stop waiting the AJAX loading while pagination.
Step 8. Start running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Click "Next"
- Click "Next"
- Click "Local Extraction"
There is Local Extraction and Cloud Extraction (premium plan). With a Local Extraction, the task will be run in your own machine; with a Cloud Extraction, the task will be run on Octoparse Cloud Platform, which means you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 9. Check and export the data
After completing the data extraction process, we can choose to check the data extracted or click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
To learn more about how to crawl data from a website, you can refer to these tutorials:
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!