Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape job data from Glassdoor

Saturday, October 27, 2018

In this tutorial, we are going to introduce how to scrape information from glassdoor.com.

To follow through, you may want to use the URL in the tutorial:

https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=true&clickSource=searchBtn&typedKeyword=marketing&sc.keyword=Marketing+Manager&locT=N&locId=1&jobType

We will click each link and scrape the company title, type, address, and other related information.

 

Here are the main steps in this tutorial.  [Download demo task file here ]

1) "Go To Web Page" - to open the targeted web page

2) Create a pagination loop - to scrape all the results from multiple pages

3) Create a "Loop Item" -  to loop click into each item on each list

4) Extract data - to select the data for extraction

5) Save and start extraction - to run the task and get data

 

 

 

 

 

 

 

1) "Go To Web Page" - to open the targeted web page

· Click "+ Task" to start a task using Advanced Mode

Advanced Mode is a highly flexible and powerful web scraping mode.

· Paste the URL into the "Extraction URL" box and click "Save URL" to move on

The first result page is opened in Octoparse now.

 

 

 

2) Create a pagination loop - to scrape all the results from multiple pages

· Scroll down and click the "Next Page" button

· Click "Loop click next page" on the "Action Tips"

 

As Glassdoor uses the AJAX technique on the Next page button, we need to set up AJAX Load for the "Pagination" action. Otherwise, Octoparse could be stuck in this step.

· Uncheck "Auto retry when no response"

· Check "Load the page with AJAX"

· Set up "AJAX Timeout"

 

 

 

 

3) Create a "Loop Item" -  to loop click into each item on each list

We can notice that the listing area is on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.

· Click "Go To Web Page" in the workflow.

· Select the pagination loop

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

· Click the titles of the listings

· Select "Loop click each element"

Octoparse will click through each listing.

 

The website also uses AJAX to load the company details. So we need to set up AJAX for the Click Item action.

· Uncheck "Auto retry when no response"

· Check "Load the page with AJAX"

· Set up the "AJAX timeout"

 

 

 

 

4) Extract data - to select the data for extraction

The tabs are applied with AJAX to load the corresponding content as well.

In this tutorial, we are going to extract the data under the "Company" and "Rating" tabs.

· Click the tab "Company"

· Select "Click element" on the "Action Tips"

· Set up the "AJAX Load" for the “Click item” action.

 

Now, let’s start to extract data under the "Company" tab.

· Select the data you want

· Click "extract data" on the "Action Tips"

 

 

Repeat the steps above to open the "Rating" tab

· Click the tab"Rating"

· Select "Click element" on the "Action Tips"

· Set up “AJAX Load”

 

Now, we can extract data under the "Rating" tab

· Select the data you want

· Click "extract data" on the "Action Tips"

 

 

 

 

 

5) Save and start extraction - to run the task and get data

· Click "Start Extraction"

· Select "Local Extraction" to run the task on your computer

 

 

Here is the sample output.

 

Was this article helpful? Contact us  any time if you need our help!

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png