Web Scraping Case Study | Crawling flight information from ticket websitesFriday, May 12, 2017 2:43 AM
Welcome to our scraping case study!
In this tutorial, we will show you how to crawl flight information from ticket website: Ctrip.com.
List of features covered
- Loop list
- Data reformat
- Data extraction
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start "
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
First, open the webpage in Octoparse's built-in browser.
- Enter the target URL in the built-in browser (URL of the example: http://www.ctrip.com.hk/chinaflights/shanghai-to-chengdu/tickets-sha-ctu/?flighttype=s&dcity=sha&acity=ctu&relddate=15&startdate=2017-05-26&startday=fri&relweek=2&searchboxArg=t)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Once the webpage finishes loading, notice how the data we want are nicely arranged in similar sections. Now, we will build a loop list of all the similar sections so that we could proceed to capturing specific data from each of the sections.
- Click any where on the first section
Note: If The selection had not been identified properly in the first place, click "Expand the selection area" to the point where the targeted section is outlined properly.
Once done, continue to build the list.
- Click "Create a list of items" (sections with similar layout)
- Click "Add current item to the list"
The first item has been added to the list, we need to finish adding all items to the list.
- Click "Continue to edit the list"
- Click a second section with similar layout
- Click "Add current item to the list" again
Now, we have successfully added all similar sections to the list.
- Click "Finish Creating List"
- Click "loop"， this action will tell Octoparse to loop through all sections on the list and capture data from each corresponding section.
Step 4. Select the data to be extracted and rename data fields
Click the first item or any other item from the list under "Loop Items", notice the selected section is now outlined.
Say we would like to capture the Airline, departure time, etc.
- Click on Airline
- Select "Extract text"
- Click on departure time
- Select "Extract text"
- Rename the data fields if necessary
Note the extraction action we are setting up now is going to apply to the other sections of the list.
- Follow the same steps to capture the other data fields
- Click "Save"
Step 5: Re-format Extracted Data
If we look closely at the sample data extracted for departure time and arrival time, it is obvious that the format is a bit messy with too many blanks. To fix this, we need to reformat this data field.
- Select data field "Time", click the icon for "Customize Field"
- Choose "Re-format extracted data"
- Click "Add step"
- Select "Replace with Regular Expression"
- Input "\s+" for "Regular Expression" and a space for "Replace with"
- Once done, click "OK"
Now, all redundant spaces should have been removed and the data now looks just right.
- Follow the same steps to re-format the other data fields
Once we have complete re-formatting all the data field needed, Click "Save".
Step 6. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Select "Local Extraction"
- Click "OK" to start
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Good job for completing this tutorial！Check out more related case studies：
Or learn more about how Octoparse can help you get the data you want: