Scraping Articles from Reuters.com
Thursday, January 5, 2017 9:22 PMFor the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.
In this tutorial, I will show you how to quickly scrape a bunch of news articles from Reuters.com.
Our data fields include the article title, body text, published date/time, and author name.
Use the sample URL below to follow through:
https://www.reuters.com/news/archive/marketsNews
Step 1. Create a Go to Web Page - to go to the target webpage
- Enter the sample URL in the search bar on the home screen and click Start
Step 2. Auto-detect the webpage - to create the workflow
- Click auto-detect webpage from the Tips panel and wait for it to complete
- Choose the desirable auto-detect results (1/3)
- Check the Paginate to scrape more pages option to see if it works for our webpage
- Uncheck Add a page scroll
- Click Create Workflow
- Click Click on links to scrape the linked page(s)
- Select the right data field for the linked page URL from the dropdown menu and Click check to see if it works
- Click Confirm to save the settings
- Select the first paragraph of the article and choose Select All from the Tips panel
- Click Extract text of the selected element
Step 3. Adjust workflow settings
- Rename the data fields for the first Extract Data action
- Click the three dots for more settings on the paragraph data field
- Click Customize XPath and Change XPath for the paragraph data field to //p[contains(@data-testid,"paragraph")]
- Click Merge multiple rows of data into one
Step 4. Save the task and run it to get data
- Click Save on the upper right to save your task
- Click Run next to it and wait for a Run Task window to pop up
- Select Run on your device to run the task on your local device
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.