Web Scraping Case Study | Scraping training course information from sulekha.com
Thursday, May 18, 2017 3:23 AMWelcome to Octoparse Web Scraping Case Study.
In this tutorial we will show you how Octoparse can be used to capture training courses information from online training website: Sulekha.com.
Features covered
- Build URL list
- Set up pagination
- Build a loop list
- Modify XPath
Now, let's get started!
Step 1. Create a new task
- Click "Quick Start"
- Create a new task(Advanced Mode)
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser ( URL Example: https://techjobs.sulekha.com/itcourses )
- Click "Go" icon to open webpage
Step 3. Create a list of items
When the webpage finished loading, you can see that the course information is arranged in a very similar layout, and we can build a loop list to scrape them all.
- Move your cursor over the item from which you want to extract data( course title in this case) and click
- Select "Create a list of items"
- Select "Add current item to the list"
Now, the first section has been added to the list successfully. Next, we need to add another similar item. This simple step of adding two similar items to a specific list is going to train Octoparse for identifying all similar items on the webpage and include all of them onto the list automatically.
- Select "Continue to edit the list"
- Click another item(Course title)
- Select "Add current item to the list" again
All similar items are now on the list.
- Select "Finish creating the list"
- Click "loop"
Step 4. Modify XPath to locate all course items
After creating the loop list, you may find that only training courses under the first category "./NET/ASP/VB/C Sharp" was added to the loop list. To include all training courses available on the webpage, we need to modify the XPath of the variable list.
- Inspect the course sections in the Firepath
Modify the XPath generated in the Variable list to get all the training courses outlined in the dashed box.
- Modify and locate all of the course items in Firepath
- Find out the proper XPath .//h3[@section='tjitcptrainingmodules']/a
- Back to Octoparse, and navigate to the "Cycle Pages" action
- Copy the modified XPath .//h3[@section='tjitcptrainingmodules']/a from Firepath, and paste it in the "Variable list" text box
- Click "Save"
Step 5. Extract data and rename the data fields
Since we only need the course titles, there is no need to navigate into the detailed information page.
However, by selecting "Loop" in step 3 , Octoparse automatically selected the first item of the list and navigated to its detailed page. Thus, we need to delete the "Click Item" action and proceed to extracting course titles from the list.
- Click "Click Item"
- Select "Delete"
- Then, drag the "Extract Data" action into the "Loop Item" action
- Click the "Extract Data" action
- Click the first course title
- Select "Extract text"
- Rename the field name if necessary
- Click "Save"
Step 6. Run your task
The task is ready, we can start extracting data now.
- Select "Local Extraction"
- Double click to start
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be executed on your local device; with a Cloud extraction, the task will be executed on Octoparse Cloud platform with no occupancy of local resources, and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Check the extracted data and export
You can check the data extraction progress in the pane.
After the task is done, click "Export" button to export the extracted data in any formats (csv, xls, etc) or database.
Well done! You have finished the extraction task!
Keep in mind that since all execution actions are interlinked with each other, a tiny mistake or change will lead to a very different result. So, please be patient and considerate.
We are here to help(support@octoparse.com or join our Facebook group: Octoparse Community.(https://www.facebook.com/groups/1700643603550408/)
Now check out similar case studies:
Scrape AJAX Pages from The Washington Post
Scrape Article Information from Google Scholar
Or learn more about XPath:
Modify XPath Manually in Octoparse
Web scraping | Introduction to Octoparse XPath Tool
Add Relative Xpath With Customization in Octoparse
Author: The Octoparse Team
For more information about Octoparse, please click here.