Why job data?
Throughout years of working in the web scraping industry and talking to users from all over the world, job data stands out as being one of the most sought after information on the web. I was honestly a bit overwhelmed until I came across Gallup's 2017 State of the American Workplace report which stated that 51% currently employed adults are searching for new jobs or looking for new work opportunities and 58% job seekers look for jobs online, in another word, this market is huge. At the same time, I was also surprised to find out there are so many ways to utilize job data, just to name a few:
- Fueling job aggregator sites with fresh job data.
- Collecting data for analyzing job trends and the labor market.
- Tracking competitors' open positions, compensations, benefits plan to get yourself a leg up the competition.
- Finding leads by pitching your service to companies that are hiring for the same.
- Staffing agencies scrape job boards to keep their job databases up-to-date.
Challenges for scraping job postings:
First and foremost, you'll need to decide where to extract this information. There are two main types of sources for job data:
- Major job aggregator sites like Indeed, Monster, Naukri, ZipRecruiter, Glassdoor, Craiglist, LinkedIn, SimplyHired, reed.co.uk, Jobster, Dice, Facebook jobs and etc.
- Every company, big or small, has a career section on their websites. Scraping those pages on a regular basis can give you the most updated list of job openings.
[Further reading: 70 Amazing Free Data Sources You Should Know]
Next, you'll need a web scraper for any of the websites mentioned above. Large job portals can be extremely tricky to scrape because they will almost always implement anti-scraping techniques to prevent scraping bots from collecting information off of them. Some of the more common blocks include IP blocks, tracking for suspicious browsing activities, honeypot traps, or using Captcha to prevent excessive page visits. If you are interested, this article provides good insights into how to go about bypassing some of the most common anti-scraping blocks. On the contrary, the company's career sections are usually easier to scrape. Yet, as each company has its own web interface/website, it requires setting up a crawler for each company separately. Such that, not only the upfront cost is high but it is also challenging to maintain the crawlers as websites undergo changes quite often.
What are the options for scraping job data?
There are a few options for how you can scrape job listings from the web.
1. Hiring a web scraping service (Daas)
These companies provide what is generally known as "managed service". Some well-known web scraping vendors are Scrapinghub, Datahen, Data Hero and etc. They will take your requests in and set up whatever is needed to get the job done, such as the scripts, the servers, the IP proxies, etc. Data will be provided to you in the format and frequencies required. Scraping services usually charge based on the number of websites, the amount of data to fetch and the frequencies of the crawl. Some companies charge additional for the number of data fields and data storage. Website complexity is, of course, a major factor that could have affected the final price. For every website setup, there's usually a once-off setup fee and monthly maintenance fee.
- No learning curve. Data is delivered to you directly.
- Highly customizable and tailored to your needs.
- The cost can be high, especially if you have a lot of websites to scrape ($350 ~ $2500 per project + $60 ~ $500 monthly maintenance fee).
- Long term maintenance cost can cause the budget to spiral out of control
- Extended development time as each website will need to be set up in its entirety (3 to 10 business days per site).
2. In-house web scraping setup
Doing web scraping in-house with your own tech team and resources comes with its perks and downfalls.
- Complete control over the crawling process.
- Fewer communication challenges, faster turnaround.
- High cost. A troop of tech costs a lot (as much as 20x more from what I've heard).
- Less expertise. Web scraping is a niche process that requires a high level of technical skills, especially if you need to scrape from some of the more popular websites or if you need to extract a large amount of data on a regular basis. Starting from scratch is tough even if you hire the professionals, whereas data service providers, as well as scraping tools, are expected to be more experienced with tackling the unanticipated obstacles.
- Loss of focus. Why not spend more time and energy on growing your business?
- Infrastructure requirements. Owning the crawling process also means you'll have to get the servers for running the scripts, data storage, and transfer. There's also a good chance you'll need a proxy service provider and a third-party Captcha solver. The process of getting all of these in place and maintaining on a daily basis can be extremely tiring and inefficient.
- Maintenance headache. Scripts need to be updated or even rewritten all the time as they will break whenever websites update layouts or codes.
- Legal risks. Web scraping is legal in most cases though there's a lot of debates going around and even the laws had not explicitly enforced one side or the other. Generally speaking, public information is safe to scrape and if you want to be more cautious about it, check and avoid infringing the TOS (terms of service) of the website. That said, should this become a concern, hiring another company/person to do the job will surely reduce the level of risk associated with it.
3. Using a web scraping tool
Technologies's been advancing and just like anything else, web scraping can now be automated. There are many web scraping software that is designed for non-technical people to fetch data from the web. These so-called web scrapers or web extractors transverse the website and capture the designated data by deciphering the HTML structure of the webpage. You'll get to "tell" the scraper what you need through "drags" and "clicks". The program learns about what you need through its built-in algorithm and performs the scraping automatically. Most scraping tools can be scheduled for regular extraction and can be integrated to your own system.
[Further reading: Top 30 Free Web Scraping Software]
- Budget-friendly. Most web scraping tools support monthly payments ($60 ~ $200 per month) and some even offer free plans that are quite robust (such as the one I use).
- Non-coder friendly. Most of them are relatively easy to use and can be handled by people with little or no technical knowledge. If you want to save time, some vendors offer crawler setup services as well as training sessions.
- Scalable. Easily supports projects of all sizes, from one to thousands of websites. Scale-up as you go.
- Fast turnaround. Depending on your efforts, a crawler can be built in 10 minutes.
- Complete control. Once you've learned the process, you can set up more crawlers or modify the existing ones without seeking help from the tech team or service provider.
- Low maintenance cost. As you won't need a troop of tech to fix the crawlers anymore, you can easily keep the maintenance cost in check.
- Learning curve. Depending on the product you choose, it can take some time to learn the process. Virtual scrapers such as import.io, dexi.io, and Octoparse are easier to learn.
- Compatibility. All web scraping tools claim to cover sites of all kinds but the truth is, there's never going to be 100% compatibility when you try to apply one tool to literally millions of websites.
- Captcha. Most of the web scraping tools out there cannot solve Captcha.
A real web scraping example...
In order to make this post more useful to you, I've decided to give you a little tutorial on how to scrape Indeed using my favorite scraping tool of all time, Octoparse. In this example, I will scrape some basic information for data scientists in New York City.
Data to extract
- Job title
- Job location
- Employer name
- Job Description
- Number of reviews
- Page URL
Creating a scraping project
1. Launch Octoparse and create a new project by clicking on "+Task" under Advanced Mode.
2. Enter the target URL (https://www.indeed.com/jobs?q=Data%20Scientist&l=New%20York%20State&_ga=2.92303069.138961637.1571107168-1638621315.1571107168) into the URL box. This is the URL copied from Chrome upon searching for "data scientists" near "New York" on Indeed.com. Click "Save URL" to proceed.
since I am using a 17" monitor, I always like to switch to the full-screen mode by toggling the workflow button at the top. This gives me a better view of the webpage.
3. Click on the first job title. Then, click on the second job title (or any other job titles will do).
4. Follow the instructions provided on "Action Tips", which now reads "10 elements selected". I obviously want to click open each one of the selected titles, so it makes sense to select "Loop click each element".
Whenever you have successfully built a list to loop through, a loop will be created and added to the workflow. Switch back to the workflow mode and see if this is the case for you.
5. Now that I am on the job page, I am going to extract the data I need by clicking on it. Click on the title of the job, the location, the number of reviews, the company name, and the job description.
6. Once done selecting the fields needed, click on "Extract data" on the "Action Tips".
7. Next, I am going to capture the Page URL by adding a pre-defined field.
- Access the task workflow by toggling the workflow button on the top.
- With the "Extract data" step of the workflow selected, click on "Add pre-defined field"
- Select "Add current page information", then"Web page URL". This will get the page URL fetched along with all the other data fields.
Octoparse will automatically generate field names for the data fields captured. If you need to re-name the data fields, simply type on top of the current ones.
8. So far I've managed to extract all the jobs listed on the first page, but I'll definitely want to extract more pages. To do this, I'll set up pagination, ie. have the Octoparse to crawl through the different page numbers.
- Return to the search result page by clicking on the loop item of the workflow.
- Scroll down the page and find the "Next" button, click on it.
- Select "Loop click single element" on "Action Tips". Octoparse will click the "Next" button until it reaches the last page ( when "Next" is no longer found on the page).
You can also specify the number of pages to extract. For example, if you want to extract only the first 3 pages, enter number "2" for "End loop when execution times reaches X". This way, Octoparse will only paginate for 2 times and ends when it reaches page-3.
9. As soon as I reach page-2, I've noticed that the "Next" element is no longer detected correctly as the auto-generated XPath now tracks the "Previous" button instead. To solve it, I'll have to modify the XPath manually.
- With the pagination loop selected, change the XPath of the single element to //SPAN[contains(text(), 'Next')].
- Now we have the correct "Next" button detected.
Learn about how to modify XPath when the auto-generated XPath fails:
10. That's it. You are done. Click on the "Extract data" button on the top to run the task.
Kindly note that if you want to try other recruitment websites (like glassdoor.com), simply check this post!
Artículo en español: Una guía completa para las publicaciones de trabajos de web scraping
También puede leer artículos de web scraping en El Website Oficial