Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape job information from indeed

Thursday, August 16, 2018

In this tutorial, we will show you how to build a web scraping with Octoparse to collect job posting information on Indeed.

By configuring the task in the app, you can easily scrape data from job information on indeed for purposes of data science, recruitment, etc.. After done configuration, the entire scraping process is all automated without coding needed.

Let's see how it's done!

 

Before we start, we need to achieve the URL of the target result page by searching keywords "DevOps" and "Dallas-Fort Worth, TX".

Then we will get the URL where we need to scrape data:

https://www.indeed.com/jobs?q=devops&l=Dallas-Fort%20Worth%2C%20TX&radius=50  

 

We will scrape job titles and description in this tutorial. 

Check out the main steps covered: [Download example task file ]

1) "Go To Web Page" - to open the target website

2) Create a pagination - to extract multiple web pages

3) Modify XPath - to paginate correctly

4) Extract data - to select data from your target website

5) Run your task - to get data you want

 

 

 

1) "Go To Web Page" - to open the target website

      · Create your task with "Advanced mode".

      · Paste the URL we just got into "Extraction URL" box and save it to move on.

 

Tips!

· We always suggest you turn on "Workflow" to get a better picture of what you are doing with the task.

 

 

 

 

2) Create a pagination - to extract multiple web pages

      · Scroll down to find the "Next" button. Since it does not automatically locate on the "A" tag of the button, we need to select "A" tag, and click "Loop click the selected link".

 

 

 

 

3) Modify XPath - to paginate correctly

XPath is a language that allows you to locate specific elements from a page precisely based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page firstly. 

      · Find the correct XPath with Firepath/Firebug extension tool in Firefox browser. The correct XPath is //span[contains(text(),'Next')][@class="np"]/../..

      · Click pagination loop in your workflow and paste the correct XPath into "Single element" box under "Advanced Options". 

 

Tips!

      · Firebug extension tool is very useful for looking up the element of an HTML document. (Firebug is now only available for old versions of Firebox. Get the old versions of Firebox here.)

      · Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated by clicking elements during the task configuration. So you need to check "Single element" in the "Loop mode" if you cannot extract data from next page.

      · If you are new to XPath, please learn more from the tutorials here. [Click here 

 

 

 

4) Extract data - to select data from your target website

      · Select the first job title, click "Select all" and "Extract link text" in "Action Tips" panel.

      · Paste the correct XPath in "Variable list" box under "Advanced Options" and click "OK" to save. The correct XPath is .//td[@id='resultsCol']/div[contains(@class,'row')]. 

      · Extract other data you want and rename the field names if necessary.

 

 

 

 

5) Run your task - to get data you want

      · Click "Start Extraction" and then select "Local Extraction".  

 

  

 

Related Articles:

XPath Introduction in Wikipedia 

Locate elements with XPath 

Extract multiple pages through pagination 

Example: Scrape business information from Yelp 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png