Scrape AJAX Pages from USA TODAYTuesday, January 10, 2017 3:41 AM
Octoparse enables you to scrape AJAX websites, that is, to scrape the AJAX content from websites.
In this web scraping tutorial we will teach you how to scrape AJAX driven websites, like www.usatoday.com website. We will scrape technology news articles from this website to get the content of latest articles - such as the title of the article, the body text of the article, published date and the author with Octoparse. There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
The website URL we will use is http://www.usatoday.com/tech/news/.
The data fields include article title, the body text of article, published date, author, the number of Facebook Connect and the number of comments.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the latest tech news articles from usatoday.com. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://www.usatoday.com/tech/news/)
Step 3. Click the "Show More News" button to extract more information.
Click on the “Show More News” button ➜ Choose "Loop click in the element" to create a loop automatically ➜ Click "Save".
Because the web page uses AJAX to load more news articles so we need to set AJAX timeout for the action to scrape the AJAX content.
Navigate to "Click to paginate" action ➜ Tick "AJAX Load" checkbox under "Advanced Options" ➜ set an AJAX timeout of 3 seconds ➜ Click "Save". And locate to the Loop Item box.
Step 4. Move your cursor over the article with similar layout, where you would extract the content of the article.
We need to right click these articles to prevent triggering the links.
Right click the first article ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first article has been added to the list. ➜ Click "Continue to edit the list".
Right click the second article ➜ Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the articles.
Because the web page uses AJAX to open these articles so we need to set AJAX timeout for the "Click Item" action.
Navigate to "Click Item" action ➜ Uncheck the "Open the link in new tab" option ➜ Tick "AJAX Load" checkbox under "Advanced Options" ➜ set an AJAX timeout of 3 seconds ➜ Click "Save".
Step 5. Extract the content of the article.
Right click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 6. Re-format the data fields.
For the data fields “Title” and "BodyText", we need modify their XPaths to correctly select the elements.
Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Enter the correct XPath ➜ Click "OK" ➜ Click "OK".
The XPath for the "Title" is .//h1[contains(@class,'headline') and contains(@itemprop,'headline')]
The XPath for the "BodyText" is .//*[@itemprop='articleBody']
Step 7. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow. Make sure that we can scrape the AJAX content from the pages.
Go to Web Page ➜ The Cycle Pages box ➜ Click to paginate ➜ The Loop Item box ➜ Click Item ➜ Extract Data.
Step 8. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!