Scrape Article Information from Google ScholarWednesday, January 11, 2017 9:16 PM
In this web scraping tutorial we will scrape the article information from the Google search results of “cancer”. We will scrape latest articles from this website to get the abstract of latest articles - such as the title of the article, the published date and the author with Octoparse. There're two parts for getting the real-time dynamic data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
The website URL we will use is https://scholar.google.com/scholar?hl=en&q=cancer&as_sdt=1%2C5&as_sdtp=&oq=.
The data fields include article title, published date, cited number, related version URLs, the article URLs, the article author and source.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the latest tech news articles from Google Scholar. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://scholar.google.com/scholar?hl=en&q=cancer&as_sdt=1%2C5&as_sdtp=&oq=)
Step 3. Right click the first abstract area. ➜ Create a list of target areas with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".
Note: When it doesn't automatically select the target area, you can click the “Expansion Area” Button on the upper right corner to adjust your target area.
Then the first article can be added to the list. ➜ Click "Continue to edit the list".
Right click the second abstract area ➜ Click "Add current item to the list" again (Now we get all the abstract with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the abstract data.
Here, we should note that when we add the second article to the list, Octoparse has helped us to add all of the rest articles both under Europe and Asia to the "Loop Item" box as we observe the item list.
Step 4. Extract the content of the article.
Right click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 5：In the Define Fields table, Click the field we just extracted and click on the ‘Customize Field’ button. Then, select the second option ‘Define ways to locate an item’.
In the ‘Matching XPath’ bar, we paste the XPath expression copied from Firepath to relocte the relative path. Then click OK.
We would see the hidden content can be extracted.
Take data field 'Title' for example.
Click “Customize Field” ➜ Click “Define ways to locate an item” ➜ Paste the XPath expression copied from Firepath to replace the auto-generated XPath parameter in the XPath input box.
In certain data field, the data extracted has not been formatted. In such case, we can also click the ‘Customize Field’ button and select the third option ‘ Reformat extracted data' and Regular Expression.
Take data field 'Year' for example.
Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step” ➜ Click “Replace with Regular Expression”➜ Click “Add step” ➜ Click “Match with Regular Expression”.
If you don’t know how to write a regular expression, you could try “Try RegEx Tool”.
Step 6. Right click the ”Next“ button on the bottom of the web page. ➜ Click the Advanced Options and select the option 'Loop click Next page'.
Here we should note that if the 'Cycle Pages' is nested in 'Loop Item'. Then, we need adjust their relative nested sequence to ensure that the 'Cycle Pages' is in the outer loop.
Step 7. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the content from the pages.
Go to Web Page ➜ The Loop Item box ➜ Click Item ➜ Extract Data ➜ Click to Paginate.
Step 8. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
After you click 'OK' in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!