Web Scraping Case Study | Security System News

Saturday, April 1, 2017 9:53 AM

In this tutorial, I will walk through the detailed steps to scrape news data from securitysystemsnews.com.

One key feature included in this tutorial is Pagination with a 'Next' button. Check out the related tutorials "Scraping from multi-pages: pagination with "Next" button" for the explicit steps to perform pagination.

(Download the extraction task of this tutorial HERE)


Some features that we will touch upon include:

  • Pagination
  • Building a list
  • Data Extraction


Now, let's get started!

Step 1. Start your task and set up basic information.

  • Click “Quick Start”
  • Choose "New Task (Advanced Mode)"
  • Complete the “Basic Information”
  • Click “Next”



Step 2. Navigate to your target webpage

  • Enter the target URL in the built-in browser 
  •  Click the "Go" icon to open the webpage                                                                                                                                                                                                    


Step 3. Set up pagination  

To extract data from websites with query string pagination, you need to add a page navigation action for pagination.

  • Click on “Next” to the right of page numbers
  • Choose “Loop Click Next Page”. This will tell Octoparse to click open each page for more extraction actions.



Step 4. Create a list of items

Looking at the webpage, it looks like the articles we want is arranged in a list format, so here we'll need to specify a list first.

  • Move the cursor over to an article, make sure the outlined box contains the data you want to extract,  click on it.
  • When prompted, click “Create a list of items”
  • Click “Add current item to the list”

Now, see how the first item had been added to the list. Then, we need to finish adding all articles to the list. 

  • Click “Continue to edit the list”
  • Click a second article with a similar layout
  • Click “Add current item to the list” again

Easily, we get all the articles added to the list.

  • Click “Finish Creating List”
  • Click “loop". This action will tell Octoparse to click on each section on the list to extract the selected data。



Step 5.  Select the data to be extracted

Once the list had been built, we now proceed to extract the data we want.

  • Right-click on the title of the first article
  • Select “Extract text”
  • Follow the same steps to extract the other data fields, ie author, sub-title, etc.
  • Edit the field names if necessary. 



Step 6. Adjust the relative loop sequence

The "Loop Item" is created outside of the "Cycle Pages" action after we created a list of items in step 4. But this doesn't make sense since we need to scrape all of the articles within the current page first, and then click to paginate. Thus, we need to adjust their relative nesting order manually by following the steps below.

  • Drag the second “Loop Item” box and position it before the “Click to paginate” action of the “Cycle Pages” loop in the Workflow Designer. 
  • Click “Save"                                                           



Step 7. Starting running your task

  • After saving your extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer.

Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.



Step 8. Check the data and export

  • Check the data extracted. The data extracted will be shown in “Data Extracted” pane.
  • Click the "Export" button to export the captured data to Excel file, databases or any other formats and save the file to your computer


Now check out similar case studies:

Or learn more about pagination:


Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.


We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline