undefined

Scraping Articles from Reuters.com

Thursday, January 5, 2017 9:22 PM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

In this tutorial, I will show you how to quickly scrape a bunch of news articles from Reuters.com.

Our data fields include the article title, body text, published date/time, and author name.

Use the sample URL below to follow through:

https://www.reuters.com/news/archive/marketsNews

 

Step 1. Create a Go to Web Page - to go to the target webpage

  • Enter the sample URL in the search bar on the home screen and click Start

 

Step 2. Auto-detect the webpage - to create the workflow

  • Click auto-detect webpage from the Tips panel and wait for it to complete
  • Choose the desirable auto-detect results (1/3)
  • Check the Paginate to scrape more pages option to see if it works for our webpage
  • Uncheck Add a page scroll
  • Click Create Workflow

 

auto-detect panel

 

  • Click Click on links to scrape the linked page(s) 

 

scrape linked pages

  • Select the right data field for the linked page URL from the dropdown menu and Click check to see if it works
  • Click Confirm to save the settings

 

select right

 

  • Select the first paragraph of the article and choose Select All from the Tips panel
  • Click Extract text of the selected element

 

Step 3. Adjust workflow settings

    • Rename the data fields for the first Extract Data action
    • Click the three dots for more settings on the paragraph data field
    • Click Customize XPath and Change XPath for the paragraph data field to //p[contains(@data-testid,"paragraph")]
    • Click Merge multiple rows of data into one

 

Step 4. Save the task and run it to get data

  • Click Save on the upper right to save your task
  • Click Run next to it and wait for a Run Task window to pop up
  • Select Run on your device to run the task on your local device

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline