Web Scraping Case Study | Security System NewsSaturday, April 1, 2017 9:53 AM
In this tutorial, I will walk through the detailed steps to scrape news data from securitysystemsnews.com.
One key feature included in this tutorial is Pagination with a 'Next' button. Check out the related tutorials "Scraping from multi-pages: pagination with "Next" button" for the explicit steps to perform pagination.
(Download the extraction task of this tutorial HERE)
Some features that we will touch upon include:
- Building a list
- Data Extraction
Now, let's get started!
Step 1. Start your task and set up basic information.
- Click “Quick Start”
- Choose "New Task (Advanced Mode)"
- Complete the “Basic Information”
- Click “Next”
Step 2. Navigate to your target webpage
- Enter the target URL in the built-in browser
- Click the "Go" icon to open the webpage
Step 3. Set up pagination
To extract data from websites with query string pagination, you need to add a page navigation action for pagination.
- Click on “Next” to the right of page numbers
- Choose “Loop Click Next Page”. This will tell Octoparse to click open each page for more extraction actions.
Step 4. Create a list of items
Looking at the webpage, it looks like the articles we want is arranged in a list format, so here we'll need to specify a list first.
- Move the cursor over to an article, make sure the outlined box contains the data you want to extract, click on it.
- When prompted, click “Create a list of items”
- Click “Add current item to the list”
Now, see how the first item had been added to the list. Then, we need to finish adding all articles to the list.
- Click “Continue to edit the list”
- Click a second article with a similar layout
- Click “Add current item to the list” again
Easily, we get all the articles added to the list.
- Click “Finish Creating List”
- Click “loop". This action will tell Octoparse to click on each section on the list to extract the selected data。
Step 5. Select the data to be extracted
Once the list had been built, we now proceed to extract the data we want.
- Right-click on the title of the first article
- Select “Extract text”
- Follow the same steps to extract the other data fields, ie author, sub-title, etc.
- Edit the field names if necessary.
Step 6. Adjust the relative loop sequence
The "Loop Item" is created outside of the "Cycle Pages" action after we created a list of items in step 4. But this doesn't make sense since we need to scrape all of the articles within the current page first, and then click to paginate. Thus, we need to adjust their relative nesting order manually by following the steps below.
- Drag the second “Loop Item” box and position it before the “Click to paginate” action of the “Cycle Pages” loop in the Workflow Designer.
- Click “Save"
Step 7. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.
Step 8. Check the data and export
- Check the data extracted. The data extracted will be shown in “Data Extracted” pane.
- Click the "Export" button to export the captured data to Excel file, databases or any other formats and save the file to your computer
Now check out similar case studies:
- Scrape AJAX Pages from The Washington Post
- Scrape Article Information from Google Scholar
- How to Scrape Wordpress Posts
Or learn more about pagination:
- Pagination Loop issue: The extraction stops after 3 pages
- Create A Loop For Pagination Manually
- Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
- Scraping from multi-pages: pagination with the "Next" button
Author: The Octoparse Team
For more information about Octoparse, please click here.