Scraping job postings from Glassdoor.com

Wednesday, May 17, 2017 10:43 AM

 

Welcome to Octoparse web scraping tutorial!

In this tutorial,  we will show you how to scrape job postings from recruitment website: Glassdoor.com. 

 

Features covered

  • Build URL list
  • Set up pagination
  • Build a loop list

 

Now, let's get started!

 

Step 1. Set up basic information

  • Click "Quick Start"
  • Create a new task (Advanced Mode)
  • Complete the basic information

 

 

Step 2. Create a loop for a list of URLs.

On glassdoor.com, job postings are classified by regions; hence, in order to capture information from multiple regions, we'll first need to make note of the different URL's associated with the various regions we would like to obtain data for, then create an extraction task to run through the list of URLs. 

For this example, we'll use the URLs below,

https://www.glassdoor.com/Job/pittsburgh-software-engineer-jobs-SRCH_IL.0,10_IC1152990_KO11,28.htm

https://www.glassdoor.com/Job/washington-software-engineer-jobs-SRCH_IL.0,10_IC1138213_KO11,28.htm

https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=software+engineer&sc.keyword=software+engineer&locT=S&locId=2280&jobType=

 

  • Drag a "Loop Item" into the Workflow Designer

 

  • Choose "URL list" in the "Loop mode".
  • Paste the the list of URLs into the "URL list" box
  • Click "Save" 
  • Drag a "Go to web page" action into the Loop action
  • Check for "Open the URL in loop item"
  • Adjust "Timeout" if necessary

 

 

Step 3. Create a list of items

Now, we can start training Octoparse to grab the data we want from each of the URLs on the list.

Once the webpage finishes loading, we can see that the job postings are arranged in similar sections located to the left of the page. So we know we'll first need to create a list for all the job postings on the page.  

  • Hover over the first job section and click

Note: If the section has not been identified properly, click "Expand the selection area" to expand, and the dashed box will show the section you have selected.

 

  • Then, click "Create a list of items" 
  • Click "Add current item to the list"

 

Now, the first item has been added to the list successfully, click on another section to extract. By doing so, Octoparse will be trained to identify all similar items from the page and include them all to the extraction list.

  • Click "Continue to edit the list"
  • Click another section with similar layout
  • Click "Add current item to the list" 

All sections are added to the list. 

  • Click "Finish Creating List"
  • Click "loop"

 

 

Step 4. Define the data to capture

So we are done building the list. Now, let's go ahead and grab the data we want from each of the sections. 

Navigate to the "Extract Data" action and click, notice the first section is outlined. Whenever we are building a loop, the first item will generally be selected automatically, however, if you would prefer to use another item other than the first item to define the following extraction actions, you can select the desired item from the loop box manually. 

  • Click on the job title, "Software Engineer" in this case
  • Select "Extract text", notice the data has been added to the customization pane
  • Follow the same steps to extract the other data fields
  • Rename "Field Name" if necessary
  • Click "Save"

 

 

Step 5. Set up pagination 

So we are done with a single page extraction setup; however, to make sure we get all data from all pages, we'll need to set up for pagination.  

  • Locate the Pagination button   on the web page and click
  • Choose "Loop Click Next Page"

This will tell Octoparse to paginate and scrape data from all pages.

 

 

Step 6. Run your task

We are done configuring the task! It's time to run the task to get the data we want.

  • Select "Local Extraction"
  • Click "OK" to run the task on your computer

Octoparse will automatically extract all the selected data. Check "Data Extracted" pane for progress. 

 

There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be executed on your local device; with a Cloud extraction, the task will be executed on Octoparse Cloud platform with no occupancy of local resources, and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

Step 7.  Export the data

Click "Export" button  to export the extracted data to any formats (csv, xls, etc) or database.

 

 

Done!

 

Good job for completing the task!

 

Note: Since all execution actions are interlinked with each other, a tiny mistake or change will lead to a very different result. So, please be patient and considerate.

 

We are here to help(support@octoparse.com or join our Facebook group: Octoparse Community.(https://www.facebook.com/groups/1700643603550408/)

 

You can check out similar case studies:

Or, learn more about related topics:

 

 

The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

 

btn_sidebar_use.png
btn_sidebar_form.png