Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.
Scrape Article Information from Google Scholar
Wednesday, January 11, 2017 9:16 PM
Google Scholar provides a simple way to broadly search for scholarly literature. As a freely accessible web search engine, it is a perfect site to scrape academic-related data.
In this tutorial, we are going to show you how to scrape search results from Google Scholar with Octoparse.
Before you start building a crawler on your own, you may want to check out the pre-built Google Scholar template for an easier way to get data. Enter your keywords to get the data extracted within minutes!
If the template falls short of your needs and you would like to build the crawler from scratch, you might continue with the tutorial. Check out the sample URL: https://scholar.google.com/ncr
We will search with multiple keywords and scrape the title, author, and description information for each article from the search results pages.
Every workflow in Octoparse starts by telling Octoparse a web page to start from.
You can also enter the URL by creating the task in advanced mode.
Either way, check if a Go to Web Page action has been generated in your workflow. If you have more than one URL, check this article to see how Octoparse handles a list of URLs. Now we have reached the target web page.
If we want to search multiple keywords on Google Scholar, we need to create a loop search action for our keyword list.
We can check if the steps are set up correctly by clicking the Loop Item and then Enter Text in the workflow to see if the text would be entered into the web page.
Now Octoparse will automatically enter every keyword in the list in the search box and click the search icon. We will go to the search result page in a new tab.
If you are on version 8 or above, Octoparse can auto-detect all sorts of web page elements and guide you through the settings on data extraction, pagination, page scroll, and so on. Use this feature to set up another loop to extract data from each result page.
Now Octoparse will go to each result page and scrape the data we want.
This step is mandatory as Google Scholar applies anti-scraping measures and may ask us to pass a reCAPTCHA test if we scrape too fast.
Now Octoparse will wait 3 seconds every time it executes the Extract Data action.
The last step is to save your task and run it.
Here is the sample output from a local run.
Tip! Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly. |
If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request here.