Web Scraping Case Study | Scraping Articles from News24Thursday, March 30, 2017 8:15 AM
In this tutorial, I will take News24.com for example to show you how to create a complete loop list which includes all the items we want to scrape without any missing ones, like the figure example shown below. To solve this, we can modify the XPath of the variables to include all of these items wanted to capture.
List features covered
Some features that we will touch upon include:
Now, let’s get started!
Download my extraction task of this tutorial HERE
Step 1. Start your task and set up basic information.
Step 2. Navigate to your target webpage
Step 3. Load content on the web page and set AJAX Timeout
First of all, we need to make sure that the items we want to scrape on the web pages are all displayed after clicking the " LOAD MORE ARTICLES " button repeatedly.
Notice that if it takes a long time to load a page when you click "Click to Paginate", you can set a longer waiting time in its next action before executing its next action.
In this web page, we use AJAX to load more news articles, then we can set AJAX Timeout for the action.
Step 4. Create a list of items
Looking at the webpage, it looks like the web elements are arranged in a list format, then we can start creating a loop list first.
(Now, the first item has been added to the list, we need to finish adding all items to the list)
After completion of creating the loop list, we can observe that the article items ordered after 5 side-by-side category-level buttons hasn't added to the loop yet.
Step 5: Modify the XPath of loop items
From Step 6, we can observe the articles are intercepted by a different web element. Thus, we should modify the XPath of the article items to locate all the article elements by following the steps below.
Now, we can see all needed items are added to the loop list.
Step 6. Set page scrolling times
Sometimes the site will continue load more items when scroll down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to the smooth of the extraction. In this scraping rule, I'd like to scroll down 5 times to display more items.
Step 7. Select the data to be extracted and modify the data fields
Then, all the content will be selected in Data Fields.
Next, we can re-name these selected data fields if necessary.
Step 8. Starting running your task
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.
Step 9. Check the data and export
( The data extracted with pagination will be shown in “Data Extracted” pane. ）
Now you've learned how to scrape data from News24.com.
Now check similar case studies:
More tutorials or blogs are available if you'd like to learn more about related topics:
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!