Scrape Articles from CNN MoneyTuesday, January 10, 2017 5:05 AM
Octoparse enables you to scrape news articles from CNN Money. There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
In this web scraping tutorial we will scrape international news articles from money.cnn.com website to get the content of latest articles - such as the title of the article, the body text of the article, published date and the author with Octoparse.
The website URL we will use is http://money.cnn.com/INTERNATIONAL/.
The data fields include article title, the body text of article, published date, author and subhead.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the latest news articles from CNNMoney. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://money.cnn.com/INTERNATIONAL/)
Step 3. Move your cursor over the article with similar layout, where you would extract the content of the article.
Right click the first article ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first article has been added to the list. ➜ Click "Continue to edit the list".
Right click the second article ➜ Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the articles.
Sometime the original XPath could not select the elements when the web pages update the articles. In this case, we will modify the XPath for the loop to better select the web elements.
Click the Loop Item box ➜ Enter the new XPath in the "Variable list" textbox➜ Click "Save".
The new XPath is //DIV[contains(@class,'intl-hp-stack-hed-container')] .
Step 4. Extract the content of the article.
Click the title of the article➜ Select "Extract text". Other contents such as author and subhead can be extracted in the same way.
But for the body text of the article, we would select the "Extract Outer HTML" and use regular expressions to extract the body text.
After all the content have been selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: You can right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 5. Re-format the data field.
For the data field "BodyText", we need to correctly select the body text of the article.
Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.
Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Enter ‘\n’ in the Regular Expression box ➜ Click “Calculate” ➜ Click “OK”.
Click “Add step” ➜ Select “Match with Regular Expression” ➜ Use "Try RegEx Tool" ➜ Check the options "Start With" and "Include Start" with the value <p ➜ Check the options "End With" and "Include End" with the value /p> ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.
Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Use "Try RegEx Tool" ➜ Check the options "Start With" and "Include Start" with the value < ➜ Check the options "End With" and "Include End" with the value > ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.
Click “Done”. Then you will see the body text of the article has been extracted correctly. ➜ Click “Save”.
Step 6. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to Web Page ➜ The Loop Item box ➜ Click Item ➜ Extract Data.
Step 7. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!