All Collections
Case Tutorial
Jobs
Scrape job info from Glassdoor
Scrape job info from Glassdoor
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Glassdoor is one of the worldwide leading platforms for insights about jobs and companies, aimed at helping people find suitable employment.

This tutorial will show you how to scrape job information from glassdoor.com.

glassdoor.jpg

To follow through with this tutorial, you may want to use the URL below:

The main steps are shown in the menu on the right, and you can download the sample task file here.


1. Create a Go to Web Page - to open the target website

  • Enter the target URL into the search bar on the home screen and click Start


2. Auto-detect the webpage - to create a workflow

  • Click Auto-detect web page data on Tips and wait for the detection to complete

auto_detect.jpg

  • Click Create workflow

create.jpg
  • Check the data fields in Data Preview and delete unwanted fields or rename them if needed

    • Delete unnecessary data fields directly by clicking More and Delete field

    • Modify the data field names by double-clicking the headers


3. Modify the XPath of the Loop Item and the data fields - to locate the fields accurately

The auto-generated XPath of some fields needs to be modified to make sure that Octoparse extracts accurate data.

  • Click Loop Item

  • Paste the updated XPath //li[@data-adv-type="GENERAL"]

  • Click Apply

  • Click the More button next to the data field to change its settings

  • Choose Customize XPath

xpath.jpg
  • Input the Matching XPath

  • Click Apply to save the change

xpath_page.jpg

We have prepared the XPaths for the fields for you. You can copy and paste them to Octoparse.

  • Company: //div[contains(@id,'job-employer')]

  • Rating: //div[@class="job-search-8wag7x"]/span[2]


4. Click on each link - to get detailed information

Sometimes you may need some extra information about the job, such as job responsibilities and requirements; thus, the next move will be to click on each link in the job list to get detailed info.

  • Click on the first job title

  • Choose Click element on the Tips panel

  • Set appropriate AJAX timeout: 7-10s recommended

ajax.jpg

Note: If you are interested in how Octoparse handles AJAX websites, please check it out here.


6. Create an Extract Data - to add custom data fields for detailed job info

  • Click the “+" icon to add a step in the workflow

  • Click Extract Data

extract_data.jpg
  • Click Add Custom Field in the Data Preview

  • Click Capture data on the page

add_field.jpg
  • Input the field name as: Job_detail

  • Choose Absolute XPath

  • Tick Absolute XPath and input Matching XPath as: //div[@class="jobDescriptionContent desc"]

  • Click Confirm to save the settings

data_field.jpg

7. Clean the data field - to refine the data

You may notice that some data in the company column has unwanted data in front of them. Use Clean data to delete unwanted text.

  • Click More >Clean data

  • Click on Add Step> Replace with Regular Expression

  • Input the Regular Expression [0-9|★|.]{1,}

  • Click Confirm

  • Click Apply

The final workflow looks like this:

workflow.jpg

8. Run the task - to get your desired data

  • Click Run to run your task either on your device or in the cloud

  • Select Standard Mode under Run on your device section to run the task on your local device

  • Wait for the task to complete

Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.

glassdoor_data.jpg
Did this answer your question?