Scrape AJAX Pages from The Washington PostMonday, January 9, 2017 9:00 PM
Octoparse enables you to scrape AJAX websites, that is, to scrape the AJAX content from websites.
In this web scraping tutorial we will teach you how to scrape AJAX driven websites. We will scrape the business news articles from washingtonpost.com to get the content of these articles - such as the title of the article, the body text of the article, published date and the author with Octoparse.
There're two parts for getting the real-time data in Octoparse - Make a scraping task and schedule a task on Octoparse's cloud platform.
The website URL we will use is https://www.washingtonpost.com/business/?nid=top_nav_business&utm_term=.52aa240ffcf3.
The data fields include article title, the body text of article, published date and author.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the articles about US market from reuters.com. (Download my extraction task of this tutorial HERE just in case you need it.)
Part 1. Make a scraping task in Octoparse
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://www.washingtonpost.com/business/?nid=top_nav_business&utm_term=.52aa240ffcf3)
Step 3. Click the "Load More" button to extract more information
We need to click on a "More" button to reveal more news articles.
Click the “Load More” button ➜ Select "Click an item" and a "Click Item" action will be created in the workflow ➜ Select the “Customize Field” button ➜ Choose “Define ways to locate an item” ➜ Copy the XPath of the button ➜ Click "Cancel" ➜ Click "Cancel".
Drag a "Loop" item into the workflow, under the "Go To Web Page" action. ➜ Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option ➜ Paste the XPath of the button ➜ Click "Save".
Drag the "Click Item" action into the Loop ➜ Locate to the "Click Item" action ➜ Tick the "Click items in the Loop Item (box)" under "Advanced Options". ➜ Click "Save".
Because the web page uses AJAX to load more news articles so we need to set AJAX timeout for the action.
Navigate to "Click Item" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 5 seconds ➜ Click "Save".
Step 4. Move your cursor over the article with similar layout, where you would extract the content of the article.
Click the Loop Item box ➜ Click the first article ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first article has been added to the list. ➜ Click "Continue to edit the list".
Click the second article ➜ Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the articles.
We modify the XPath of the Loop Item box to make sure we could extract all the articles from the web page.
Click the second Loop Item box ➜ Choose a "Loop Mode" under "Advanced Options". ➜ Select "Variable list" option ➜ Enter the XPath to correctly extract all the articles from the page ➜ Click "Save".
Step 5. Extract the content of the article.
Right click the title of the article➜ Select "Extract text". Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.
Step 6. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to Web Page ➜ The first Loop Item box (Tick "End loop when" checkbox: executing the action 3 times and stop performing the loop) ➜ Click Item ➜ The second Loop Item box ➜ Click Item ➜ Extract Data.
Step 7. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Part 2. Schedule a task and run it on Octoparse's cloud platform.
After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.
Step 1. Find out the task you've just made ➜ double click the task to open it ➜ keep clicking "Next" until you are in the "Done" step ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.
Step 2. Set the parameters.
In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.
· Periods of Availability - The data extraction period by setting the Start date and End date.
· Run Mode - Once, Weekly, Monthly, Real Time
We can set a suitable time interval to collect the stock and click "Start" to schedule your task. After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!