You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Glassdoor is one of the worldwide leading platforms for insights about jobs and companies, aimed at helping people find suitable employment.

This tutorial will show you how to scrape job information from glassdoor.com.

To follow through with this tutorial, you may want to use the URL below:

https://www.glassdoor.com/Job/us-marketing-manager-jobs-SRCH_IL.0,2_IN1_KO3,20.htm

The main steps are shown in the menu on the right, and you can download the sample task file here.

1. Create a Go to Web Page - to open the target website

Enter the target URL into the search bar on the home screen and click Start

2. Auto-detect the webpage - to create a workflow

Click Auto-detect web page data on Tips and wait for the detection to complete

Click Create workflow

Check the data fields in Data Preview and delete unwanted fields or rename them if needed
- Delete unnecessary data fields directly by clicking More and Delete field
- Modify the data field names by double-clicking the headers

3. Modify the XPath of the Loop Item and the data fields - to locate the fields accurately

The auto-generated XPath of some fields needs to be modified to make sure that Octoparse extracts accurate data.

Click Loop Item
Paste the updated XPath //li[@data-adv-type="GENERAL"]
Click Apply

Click the More button next to the data field to change its settings
Choose Customize XPath

Input the Matching XPath
Click Apply to save the change

We have prepared the XPaths for the fields for you. You can copy and paste them to Octoparse.

Company: //div[contains(@id,'job-employer')]
Rating: //div[@class="job-search-8wag7x"]/span[2]

4. Click on each link - to get detailed information

Sometimes you may need some extra information about the job, such as job responsibilities and requirements; thus, the next move will be to click on each link in the job list to get detailed info.

Click on the first job title
Choose Click element on the Tips panel

Set appropriate AJAX timeout: 7-10s recommended

Note: If you are interested in how Octoparse handles AJAX websites, please check it out here.

6. Create an Extract Data - to add custom data fields for detailed job info

Click the “+" icon to add a step in the workflow
Click Extract Data

Click Add Custom Field in the Data Preview
Click Capture data on the page

Input the field name as: Job_detail
Choose Absolute XPath
Tick Absolute XPath and input Matching XPath as: //div[@class="jobDescriptionContent desc"]
Click Confirm to save the settings

7. Clean the data field - to refine the data

You may notice that some data in the company column has unwanted data in front of them. Use Clean data to delete unwanted text.

Click More >Clean data

Click on Add Step> Replace with Regular Expression
Input the Regular Expression [0-9|★|.]{1,}
Click Confirm
Click Apply

The final workflow looks like this:

8. Run the task - to get your desired data

Click Run to run your task either on your device or in the cloud
Select Standard Mode under Run on your device section to run the task on your local device
Wait for the task to complete

Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.

Scrape articles from Medium

Scrape company info from Goodfirms.co

Scrape product info from Myntra (Jabong)

Scrape posts from LinkedIn

Scrape job information from Indeed