Web Scraping Case Study | Scraping training course information from sulekha.comThursday, May 18, 2017 3:23 AM
Welcome to Octoparse Web Scraping Case Study.
In this tutorial we will show you how Octoparse can be used to capture training courses information from online training website: Sulekha.com.
Now, let's get started!
Step 1. Create a new task
Step 2. Navigate to the target website
Step 3. Create a list of items
When the webpage finished loading, you can see that the course information is arranged in a very similar layout, and we can build a loop list to scrape them all.
Now, the first section has been added to the list successfully. Next, we need to add another similar item. This simple step of adding two similar items to a specific list is going to train Octoparse for identifying all similar items on the webpage and include all of them onto the list automatically.
All similar items are now on the list.
Step 4. Modify XPath to locate all course items
After creating the loop list, you may find that only training courses under the first category "./NET/ASP/VB/C Sharp" was added to the loop list. To include all training courses available on the webpage, we need to modify the XPath of the variable list.
Modify the XPath generated in the Variable list to get all the training courses outlined in the dashed box.
Step 5. Extract data and rename the data fields
Since we only need the course titles, there is no need to navigate into the detailed information page.
However, by selecting "Loop" in step 3 , Octoparse automatically selected the first item of the list and navigated to its detailed page. Thus, we need to delete the "Click Item" action and proceed to extracting course titles from the list.
Step 6. Run your task
The task is ready, we can start extracting data now.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be executed on your local device; with a Cloud extraction, the task will be executed on Octoparse Cloud platform with no occupancy of local resources, and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Check the extracted data and export
You can check the data extraction progress in the pane.
After the task is done, click "Export" button to export the extracted data in any formats (csv, xls, etc) or database.
Well done! You have finished the extraction task!
Keep in mind that since all execution actions are interlinked with each other, a tiny mistake or change will lead to a very different result. So, please be patient and considerate.
We are here to help(firstname.lastname@example.org or join our Facebook group: Octoparse Community.(https://www.facebook.com/groups/1700643603550408/)
Now check out similar case studies:
Or learn more about XPath:
Author: The Octoparse Team
For more information about Octoparse, please click here.