Scraping job postings from Glassdoor.com
Wednesday, May 17, 2017 10:43 AM
Welcome to Octoparse web scraping tutorial!
In this tutorial, we will show you how to scrape job postings from recruitment website: Glassdoor.com.
Features covered
- Build URL list
- Set up pagination
- Build a loop list
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task (Advanced Mode)
- Complete the basic information
Step 2. Create a loop for a list of URLs.
On glassdoor.com, job postings are classified by regions; hence, in order to capture information from multiple regions, we'll first need to make note of the different URL's associated with the various regions we would like to obtain data for, then create an extraction task to run through the list of URLs.
For this example, we'll use the URLs below,
https://www.glassdoor.com/Job/pittsburgh-software-engineer-jobs-SRCH_IL.0,10_IC1152990_KO11,28.htm
https://www.glassdoor.com/Job/washington-software-engineer-jobs-SRCH_IL.0,10_IC1138213_KO11,28.htm
https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=software+engineer&sc.keyword=software+engineer&locT=S&locId=2280&jobType=
- Drag a "Loop Item" into the Workflow Designer
- Choose "URL list" in the "Loop mode".
- Paste the the list of URLs into the "URL list" box
- Click "Save"
- Drag a "Go to web page" action into the Loop action
- Check for "Open the URL in loop item"
- Adjust "Timeout" if necessary
Step 3. Create a list of items
Now, we can start training Octoparse to grab the data we want from each of the URLs on the list.
Once the webpage finishes loading, we can see that the job postings are arranged in similar sections located to the left of the page. So we know we'll first need to create a list for all the job postings on the page.
- Hover over the first job section and click
Note: If the section has not been identified properly, click "Expand the selection area" to expand, and the dashed box will show the section you have selected.
- Then, click "Create a list of items"
- Click "Add current item to the list"
Now, the first item has been added to the list successfully, click on another section to extract. By doing so, Octoparse will be trained to identify all similar items from the page and include them all to the extraction list.
- Click "Continue to edit the list"
- Click another section with similar layout
- Click "Add current item to the list"
All sections are added to the list.
- Click "Finish Creating List"
- Click "loop"
Step 4. Define the data to capture
So we are done building the list. Now, let's go ahead and grab the data we want from each of the sections.
Navigate to the "Extract Data" action and click, notice the first section is outlined. Whenever we are building a loop, the first item will generally be selected automatically, however, if you would prefer to use another item other than the first item to define the following extraction actions, you can select the desired item from the loop box manually.
- Click on the job title, "Software Engineer" in this case
- Select "Extract text", notice the data has been added to the customization pane
- Follow the same steps to extract the other data fields
- Rename "Field Name" if necessary
- Click "Save"
Step 5. Set up pagination
So we are done with a single page extraction setup; however, to make sure we get all data from all pages, we'll need to set up for pagination.
- Locate the Pagination button
on the web page and click
- Choose "Loop Click Next Page"
This will tell Octoparse to paginate and scrape data from all pages.
Step 6. Run your task
We are done configuring the task! It's time to run the task to get the data we want.
- Select "Local Extraction"
- Click "OK" to run the task on your computer
Octoparse will automatically extract all the selected data. Check "Data Extracted" pane for progress.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be executed on your local device; with a Cloud extraction, the task will be executed on Octoparse Cloud platform with no occupancy of local resources, and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Export the data
Click "Export" button to export the extracted data to any formats (csv, xls, etc) or database.
Done!
Good job for completing the task!
Note: Since all execution actions are interlinked with each other, a tiny mistake or change will lead to a very different result. So, please be patient and considerate.
We are here to help(support@octoparse.com or join our Facebook group: Octoparse Community.(https://www.facebook.com/groups/1700643603550408/)
You can check out similar case studies:
- Web Scraping Case Study | Security System News
- How to Extract Data from eBay
- How to extract data in the list on eBay
Or, learn more about related topics:
- Get Started with Octoparse in 2 Minutes
- Create A Loop For Pagination Manually
- Scrape ASPX Pages
- Web scraping | Introduction to Octoparse XPath Tool
- Modify XPath Manually in Octoparse
The Octoparse Team
For more information about Octoparse, please click here.