undefined

Web Scraping Case Study | Scraping Articles from News24

Thursday, March 30, 2017 8:15 AM

Brief Intro

In this tutorial, I will take News24.com for example to show you how to create a complete loop list which includes all the items we want to scrape without any missing ones, like the figure example shown below. To solve this, we can modify the XPath of the variables to include all of these items wanted to capture.  

 

 

 

List features covered

Some features that we will touch upon include:

  • Configuration of "Load More" button
  • Setting Scrolling Screen
  • AJAX Timeout
  • Modify XPath

 

Now, let’s get started!                      

Download my extraction task of this tutorial HERE

 

Step 1. Start your task and set up basic information.

  • Click “Quick Start”
  • Choose "New Task (Advanced Mode)"
  • Complete the “Basic Information”
  • Click “Next”

 

Step 2. Navigate to your target webpage

  • Click “Go” icon to open the webpage  

                                                                                                                                                                                                                                                                                                                              

Step 3. Load content on the web page and set AJAX Timeout

First of all, we need to make sure that the items we want to scrape on the web pages are all displayed after clicking the " LOAD MORE ARTICLES " button repeatedly.

  • Click "LOAD MORE ARTICLES"
  • Select "Loop click the element "

Notice that if it takes a long time to load a page when you click "Click to Paginate", you can set a longer waiting time in its next action before executing its next action.

In this web page, we use AJAX to load more news articles, then we can set AJAX Timeout for the action.

  • Navigate to "Click to Paginate" action
  • Tick "AJAX Load" checkbox
  • Set an AJAX timeout of 2 seconds
  • Click "Save"

 

 

Step 4. Create a list of items

Looking at the webpage, it looks like the web elements are arranged in a list format, then we can start creating a loop list first.

  • Move your cursor over the article with similar layout, where you would extract the content of the article. Click any where on the first section on the web page. Make sure the outlined box contains the data to be extracted. 
  • When prompted, Click “Create a list of items” 
  • Click “Add current item to the list”

         (Now, the first item has been added to the list, we need to finish adding all items to the list)

  • Click “Continue to edit the list”
  • Click a second section with similar layout
  • Click “Add current item to the list” again
  • Click “Finish Creating List”
  • Click “loop". This action will tell Octoparse to click on each section on the list to extract the selected data

After completion of creating the loop list, we can observe that the article items ordered after 5 side-by-side category-level buttons hasn't added to the loop yet. 

 

 

Step 5: Modify the XPath of loop items

From Step 6, we can observe the articles are intercepted by a different web element. Thus, we should modify the XPath of the article items to locate all the article elements by following the steps below.

  • Go to "Loop mode" 
  • Select "Variable list" 
  • Modify the XPath of the Variable list as : //div[@class='synopsis_head']/h5[1]/a[1]
  • Click "Save"

Now, we can see all needed items are added to the loop list.

 

 

Step 6. Set page scrolling times

Sometimes the site will continue load more items when scroll down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to the smooth of the extraction. In this scraping rule, I'd like to scroll down 5 times to display more items. 

  • Navigate to "Cycle Pages" action
  • Go to "End loop when" and expand its selection box
  • Select "5" times
  • Click "Save"

 

 

 Step 7. Select the data to be extracted and modify the data fields

  • Select the data to be extracted
  • Right click on the title of the article
  • Select “Extract text”
  • Follow the same steps the to extract other data fields

Then, all the content will be selected in Data Fields.

Next, we can re-name these selected data fields if necessary.

  • Click "Save"                                   

 

 

Step 8. Starting running your task

  • After saving your extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer

Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.

 

 

Step 9. Check the data and export

  • Check the data extracted   

       ( The data extracted with pagination will be shown in “Data Extracted” pane. )        

  • Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer

                                                                                                                          

 

 

Now you've learned how to scrape data from News24.com. 

Now check similar case studies:

More tutorials or blogs are available if you'd like to learn more about related topics:

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

Octoparse Smart Mode -- Get Data in Seconds

Get Started with Octoparse in 2 Minutes

Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected

Scrape Data from Website with Pagination - Infinite Scrolling

Collect Data from eBay

Top 30 Free Web Scraping Software

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept Close