undefined

Scraping Articles from Yahoo! Tech

Tuesday, January 3, 2017 4:51 AM

 

Octoaprse enables you to scrape latest news from yahoo.com/tech. There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task to run it in Octoparse cloud.

 

In this web scraping tutorial we will scrape the technology news website of Yahoo! to get article information - such as the title, body, published date and author with Octoparse.

The website URL we will use is https://www.yahoo.com/tech.

The data fields include article title, article body, published date and author.

 

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yahoo! Tech articles. (Download my extraction task of this tutorial HERE just in case you need it.)

 

Part 1. Make a scraping task in Octoparse

 

Step 1. Set up basic information.

 

Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜ Complete basic information ➜ Click "Next".

 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

(URL of the example: https://www.yahoo.com/tech)

 

 

When we scroll to the bottom of this web page, new content will load automatically. So we need to check the option "Scroll Down" under "Advanced Options"  ➜ Set it to execute 10 times.  set a 1-second time interval ➜ Select "Scroll to the bottom of the page".

Select "Clear cache before opening the web page" under "Cache Settings" to better schedule the task when running in the cloud platform.

 

Step 3. Move your cursor over the section with similar layout, where you would extract data about these articles.

 

Right click the first article ➜ Click the "Expand the selected area" button to select the A tag ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first article has been added to the list. ➜ Click "Continue to edit the list".

Right click the second article ➜ Click the "Expand the selected area" button to select the A tag ➜ Click "Add current item to the list" again. ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements from these articles. 

 

 

We noticed that the XPath for the loop couldn't correctly extract all the articles with similar layout and we need to modify it.

 

 

Click the Loop Item box  ➜ Replace the original one with the correct XPath ➜ Click "Save" ➜ Locate to the "Click Item" action ➜ uncheck the "Open the link in new tab" option ➜  Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 15 seconds ➜ Click "Save".

The correct XPath is //*[contains(@class,'Fw($fweight) Lh($lheight) Lts($lspacing-sm) Fsm($fsmoothing) Fsmw($fsmoothing) Fsmm($fsmoothing)')]

 

Step 4. Extract information from these articles.

 

Right click the article title ➜ Select "Extract text". Other contents can be extracted in the same way. 

All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".

 

 Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.

 

Step 5. Re-format the data field.

 

For data field “Author”, we will modify its XPath to select the element correctly.

Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Enter the correct XPath ➜ Click "OK" ➜ Click "OK".

The XPath for the "Author" is 

.//*[@id='SideTop-3-HeadComponentAttribution']/.//div[contains(@class,'auth-prov-so')]/div[contains(@class,'author')]

 

For data field “BodyText”, we will modify its XPath and use regular expression to select the element correctly. 

 

Step 5-1. Modify its XPath

 

Choose the data field "BodyText" ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Enter the correct XPath ➜ Click "OK" ➜ Click "OK".

The XPath for the "Author" is .//*[@id='Col1-0-ContentCanvas']/article

 

Step 5-2. Use regular expressions to re-format the data field.

 

We will use regular expressions to remove the image tags.

 

Click “Add step” ➜ Select “Replace with Regular Expression”.

Use "Try RegEx Tool" ➜ Check the options "Start With" and "Include Start" with the value <img ➜ Check the options "End With" and "Include End" with the value /> ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply”.

Click "Calculate" ➜ Click “OK” ➜ Click “Done”. Then you will see the body text of the article has been extracted correctly. ➜ Click “Save”.

 

Step 6. Copy and paste the "Go To Web Page" action

 

We need to add the the "Go To Web Page" action to return to the original web page to extract the content of other news articles.

Right click the "Go To Web Page" action and choose "Copy" ➜ Locate to the "Extract Data" action ➜ Right click the "Extract Data" action and choose "Paste". A "Go To Web Page" action will be generated ➜ Click “Save”.

 

Step 7. Check the workflow

 

Now we need to check the workflow by clicking actions from the beginning of the workflow.

Go to Web Page ➜ The Loop Item box ➜ Click Item ➜Extract Data ➜ Go to Web Page

 

 

Step 8. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.

 

 

Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

 

 

Part 2. Schedule a task and run it on Octoparse's cloud platform.

 

After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it in Octoparse cloud.

 

Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process. 

 

 

Step 2. Set the parameters. 

 

In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.

 

 · Periods of Availability - The data extraction period by setting the Start date and End date.

 · Run Mode - Once, Weekly, Monthly, Real Time 

 

We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.

 

 

Author: The Octoparse Team

 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept Close