Web Scraping Case Study | Scraping training course information from sulekha.com

Thursday, May 18, 2017 3:23 AM

 

 

Welcome to Octoparse Web Scraping Case Study. 

In this tutorial we will show you how Octoparse can be used to capture training courses information from online training website: Sulekha.com. 

 

Features covered 

  • Build URL list
  • Set up pagination
  • Build a loop list
  • Modify XPath

 

Now, let's get started!

 

 

Step 1. Create a new task

  • Click "Quick Start"
  • Create a new task(Advanced Mode)
  • Complete the basic information

 

 

Step 2. Navigate to the target website

 

 

 

Step 3. Create a list of items

When the webpage finished loading, you can see that the course information is arranged in a very similar layout, and we can build a loop list to scrape them all.

  • Move your cursor over the item from which you want to extract data( course title in this case) and click
  • Select "Create a list of items" 
  • Select "Add current item to the list"

 

Now, the first section has been added to the list successfully. Next, we need to add another similar item. This simple step of adding two similar items to a specific list is going to train Octoparse for identifying all similar items on the webpage and include all of them onto the list automatically. 

  • Select "Continue to edit the list"
  • Click another item(Course title)
  • Select "Add current item to the list" again

All similar items are now on the list.

  • Select "Finish creating the list"
  • Click "loop"

 

 

Step 4. Modify XPath to locate all course items

After creating the loop list, you may find that only training courses under the first category "./NET/ASP/VB/C Sharp" was added to the loop list. To include all training courses available on the webpage, we need to modify the XPath of the variable list.

  • Inspect the course sections in the Firepath

 

 

 

Modify the XPath generated in the Variable list to get all the training courses outlined in the dashed box.

  • Modify and locate all of the course items in Firepath
  • Find out the proper XPath .//h3[@section='tjitcptrainingmodules']/a

 

 

  • Back to Octoparse, and navigate to the "Cycle Pages" action
  • Copy the modified XPath .//h3[@section='tjitcptrainingmodules']/a from Firepath, and paste it in the "Variable list" text box
  • Click "Save"

 

 

Step 5. Extract data and rename the data fields

Since we only need the course titles, there is no need to navigate into the detailed information page.

However, by selecting "Loop" in step 3 , Octoparse automatically selected the first item of the list and navigated to its detailed page. Thus, we need to delete the "Click Item" action and proceed to extracting course titles from the list.

  • Click "Click Item"
  • Select "Delete"

 

  • Then, drag the "Extract Data" action into the "Loop Item" action
  • Click the "Extract Data" action 

 

  • Click the first course title
  • Select "Extract text"
  • Rename the field name if necessary
  • Click "Save"

 

 

Step 6. Run your task 

The task is ready, we can start extracting data now.

  • Select "Local Extraction"
  • Double click to start

 

There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be executed on your local device; with a Cloud extraction, the task will be executed on Octoparse Cloud platform with no occupancy of local resources, and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

 

Step 7. Check the extracted data and export

You can check the data extraction progress in the pane.

After the task is done, click "Export" button   to export the extracted data in any formats (csv, xls, etc) or database.

 

 

Well done! You have finished the extraction task!

Keep in mind that since all execution actions are interlinked with each other, a tiny mistake or change will lead to a very different result. So, please be patient and considerate.

 

We are here to help(support@octoparse.com or join our Facebook group: Octoparse Community.(https://www.facebook.com/groups/1700643603550408/)

 

Now check out similar case studies:

Scrape AJAX Pages from The Washington Post

Scrape Article Information from Google Scholar

How to Scrape Wordpress Posts

 

Or learn more about XPath:

Modify XPath Manually in Octoparse

Web scraping | Introduction to Octoparse XPath Tool

How to Use XPath in Octoparse

Add Relative Xpath With Customization in Octoparse

 

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

 

 

btn_sidebar_use.png
btn_sidebar_form.png