Web Scraping Case Study | Scraping Articles from News24Thursday, March 30, 2017 8:15 AM
In this tutorial, I will take News24.com for example to show you how to create a complete loop list which includes all the items we want to scrape without any missing ones, like the figure example shown below. To solve this, we can modify the XPath of the variables to include all of these items wanted to capture.
List features covered
Some features that we will touch upon include:
- Configuration of "Load More" button
- Setting Scrolling Screen
- AJAX Timeout
- Modify XPath
Now, let’s get started!
Download my extraction task of this tutorial HERE
Step 1. Start your task and set up basic information.
- Click “Quick Start”
- Choose "New Task (Advanced Mode)"
- Complete the “Basic Information”
- Click “Next”
Step 2. Navigate to your target webpage
- Enter the target URL in the built-in browser. URL for the example: http://www.fin24.com/budget)
- Click “Go” icon to open the webpage
Step 3. Load content on the web page and set AJAX Timeout
First of all, we need to make sure that the items we want to scrape on the web pages are all displayed after clicking the " LOAD MORE ARTICLES " button repeatedly.
- Click "LOAD MORE ARTICLES"
- Select "Loop click the element "
Notice that if it takes a long time to load a page when you click "Click to Paginate", you can set a longer waiting time in its next action before executing its next action.
In this web page, we use AJAX to load more news articles, then we can set AJAX Timeout for the action.
- Navigate to "Click to Paginate" action
- Tick "AJAX Load" checkbox
- Set an AJAX timeout of 2 seconds
- Click "Save"
Step 4. Create a list of items
Looking at the webpage, it looks like the web elements are arranged in a list format, then we can start creating a loop list first.
- Move your cursor over the article with similar layout, where you would extract the content of the article. Click any where on the first section on the web page. Make sure the outlined box contains the data to be extracted.
- When prompted, Click “Create a list of items”
- Click “Add current item to the list”
(Now, the first item has been added to the list, we need to finish adding all items to the list)
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
- Click “Finish Creating List”
- Click “loop". This action will tell Octoparse to click on each section on the list to extract the selected data
After completion of creating the loop list, we can observe that the article items ordered after 5 side-by-side category-level buttons hasn't added to the loop yet.
Step 5: Modify the XPath of loop items
From Step 6, we can observe the articles are intercepted by a different web element. Thus, we should modify the XPath of the article items to locate all the article elements by following the steps below.
- Go to "Loop mode"
- Select "Variable list"
- Modify the XPath of the Variable list as : //div[@class='synopsis_head']/h5/a
- Click "Save"
Now, we can see all needed items are added to the loop list.
Step 6. Set page scrolling times
Sometimes the site will continue load more items when scroll down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to the smooth of the extraction. In this scraping rule, I'd like to scroll down 5 times to display more items.
- Navigate to "Cycle Pages" action
- Go to "End loop when" and expand its selection box
- Select "5" times
- Click "Save"
Step 7. Select the data to be extracted and modify the data fields
- Select the data to be extracted
- Right click on the title of the article
- Select “Extract text”
- Follow the same steps the to extract other data fields
Then, all the content will be selected in Data Fields.
Next, we can re-name these selected data fields if necessary.
- Click "Save"
Step 8. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.
Step 9. Check the data and export
- Check the data extracted
( The data extracted with pagination will be shown in “Data Extracted” pane. ）
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Now you've learned how to scrape data from News24.com.
Now check similar case studies:
- Scrape Data from Multiple Web Pages (Example: Medline)
- Scrape AJAX Pages from USA TODAY
- Scrape Articles from CNN Money
More tutorials or blogs are available if you'd like to learn more about related topics:
- How to Budget Smarter with Big Data
- Reasons and Solutions - Missing Data in Cloud Extraction
- How to scrape detail page data with pagination?
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Octoparse Smart Mode -- Get Data in Seconds
Get Started with Octoparse in 2 Minutes
Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
Scrape Data from Website with Pagination - Infinite Scrolling
Top 30 Free Web Scraping Software
Top 30 Free Web Scraping Software- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf