You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Glassdoor is one of the worldwide leading platforms for insights about jobs and companies, aimed at helping people find suitable employment.
This tutorial will show you how to scrape job information from glassdoor.com.
To follow through with this tutorial, you may want to use the URL below:
The main steps are shown in the menu on the right, and you can download the sample task file here.
1. Create a Go to Web Page - to open the target website
2. Auto-detect the webpage - to create a workflow
Click Auto-detect web page data on Tips and wait for the detection to complete
Click Create workflow
Check the data fields in Data Preview and delete unwanted fields or rename them if needed
Delete unnecessary data fields directly by clicking More and Delete field
3. Modify the XPath of the Loop Item and the data fields - to locate the fields accurately
The auto-generated XPath of some fields needs to be modified to make sure that Octoparse extracts accurate data.
Click the More button next to the data field to change its settings
Choose Customize XPath
Input the Matching XPath
Click Apply to save the change
We have prepared the XPaths for the fields for you. You can copy and paste them to Octoparse.
Company: //div[contains(@id,'job-employer')]
Rating: //div[@class="job-search-8wag7x"]/span[2]
4. Click on each link - to get detailed information
Sometimes you may need some extra information about the job, such as job responsibilities and requirements; thus, the next move will be to click on each link in the job list to get detailed info.
Set appropriate AJAX timeout: 7-10s recommended
6. Create an Extract Data - to add custom data fields for detailed job info
Click the “+" icon to add a step in the workflow
Click Extract Data
Click Add Custom Field in the Data Preview
Click Capture data on the page
Input the field name as: Job_detail
Choose Absolute XPath
Tick Absolute XPath and input Matching XPath as: //div[@class="jobDescriptionContent desc"]
Click Confirm to save the settings
7. Clean the data field - to refine the data
You may notice that some data in the company column has unwanted data in front of them. Use Clean data to delete unwanted text.
Click on Add Step> Replace with Regular Expression
Input the Regular Expression [0-9|★|.]{1,}
Click Confirm
Click Apply
The final workflow looks like this:
8. Run the task - to get your desired data
Click Run to run your task either on your device or in the cloud
Select Standard Mode under Run on your device section to run the task on your local device
Wait for the task to complete
Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.