Web Scraping Case Study | Crawling flight information from ticket websites

Friday, May 12, 2017 2:43 AM

 

Welcome to our scraping case study!

In this tutorial, we will show you how to crawl flight information from ticket website: Ctrip.com. 

 

List of features covered 

  • Loop list
  • Data reformat
  • Data extraction

 

Now, let's get started!

 

Step 1. Set up basic information 

  • Click "Quick Start "
  • Create a new task in the Advanced Mode
  • Complete the basic information

 

 

 

Step 2. Navigate to the target website

First, open the webpage in Octoparse's built-in browser. 

 

 

Step 3. Create a list of items

Once the webpage finishes loading, notice how the data we want are nicely arranged in similar sections. Now, we will build a loop list of all the similar sections so that we could proceed to capturing specific data from each of the sections. 

  • Click any where on the first section 

Note: If The selection had not been identified properly in the first place, click "Expand the selection area" to the point where the targeted section is outlined properly. 

 

 

Once done, continue to build the list. 

  • Click "Create a list of items" (sections with similar layout)
  • Click "Add current item to the list"

 

The first item has been added to the list, we need to finish adding all items to the list.

  • Click "Continue to edit the list"
  • Click a second section with similar layout
  • Click "Add current item to the list" again

Now, we have successfully added all similar sections to the list.

  • Click "Finish Creating List"
  • Click "loop", this action will tell Octoparse to loop through all sections on the list and capture data from each corresponding section. 

 

Step 4. Select the data to be extracted and rename data fields

Click the first item or any other item from the list under "Loop Items", notice the selected section is now outlined. 

Say we would like to capture the Airline, departure time, etc.

  • Click on Airline
  • Select "Extract text"
  • Click on departure time
  • Select "Extract text"
  • Rename the data fields if necessary

Note the extraction action we are setting up now is going to apply to the other sections of the list.

 

  • Follow the same steps to capture the other data fields
  • Click "Save"

 

 

 

Step 5: Re-format Extracted Data

If we look closely at the sample data extracted for departure time and arrival time, it is obvious that the format is a bit messy with too many blanks. To fix this, we need to reformat this data field. 

  • Select data field "Time", click the icon for "Customize Field" 
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Replace with Regular Expression"
  • Input "\s+" for "Regular Expression" and a space for "Replace with" 
  • Once done, click "OK"

Now, all redundant spaces should have been removed and the data now looks just right. 

  • Follow the same steps to re-format the other data fields

 

Once we have complete re-formatting all the data field needed, Click "Save".

 

Step 6. Starting running your task 

Now we are done configuring the task, it's time to run the task to get the data we want.

  • Select "Local Extraction"
  • Click "OK" to start

 

There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

 

Step 7. Check the data and export

  • Check the data extracted
  • Click "Export" button   to export the results to Excel file, databases or other formats and save the file to your computer.

 

Done !

 

Good job for completing this tutorial!Check out more related case studies:

Scraping Hotel Reviews from Tripadvisor.com

Web Scraping Hotel Information from Google Maps

Web Scraping Case Study | Scraping information from Capterra

 

 

Or learn more about how Octoparse can help you get the data you want:

Octoparse Smart Mode -- Get Data in Seconds

Re-format Captured Data (Add prefix, replace text,etc.) in Octoparse

Use Regular Expressions in Octoparse

Get Started with Octoparse in 2 Minutes

 

 

 

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png