Making a Simple Web Scraper with Octoparse
Thursday, March 09, 2017
Data that is visible on the web page is not always accessible, and hidden data that exist in the HTML documents can reveal more valuable information most often. To make good use of the content posted on the web, we can first extract the data from the web for valid purposes, legally. This process is called web scraping, and the tool used during this data extraction is called web scraper. Usually we copy and paste web content manually if we don’t know programming, and using traditional web extraction methods are time consuming and inefficiency. Besides, most of the information on the website is written in different forms on the web - within a HTML tag or a HTML attribute. Therefore it’s better for non-programmers to use some web scraping software that can grab the exact content you want to pull out of the website and combine the data with your own system/database.
(picture from neerajkumar.name)
I assume you want to extract data from websites by hand when you are reading this article and thinking of making a simple web scraper. In fact, it’s easy to make such a simple web scraper with some automated web data extraction software and you don't even need to know how to write code. All you need is to pick the right tool to help you. So how to choose the best software to help make a simple web scraper with so many web data extraction software to choose from?
What is the first thing that comes into your head then? Well, it’s best that the software is free. Then, we will select Octoparse, a powerful automated data extraction software that offers a rich advanced feature set to help you extract all the text in the HTML documents. Click HERE to learn more about Octoparse.
Honestly, it would be easier to understand how a web scraper work if you know the structure of a web page. Let’s get started to make a simple web scraper using Octoparse - extracting the title and URLs of all the case tutorials from octoparse.com.
Step 1. Download Octoparse and launch it. Choose the Wizard Mode and click on the “Start” button.
Step 2. Click on the “Create” button under “List and Detail Extraction”, then enter the basic information for the web scraper.
Step 3. Enter the URL from which we want to pull data.
Step 4. Click random two items of the web page and click on the “Next” button.
Step 5. Check the “Enable pagination” option, and go to the bottom of the web page to click the “Next Page” link with 4 times, then click on the “Next” button. Octoparse will take you to the tutorial detail page.
Step 6. Click the content you want from the tutorial, and click on the “Next” button.
Step 7. Now you are done making a simple web scraper! Click “Local Extraction” to begin extracting data from octoparse.com.
The data extraction results screen appears in the Data Extracted pane. You can export the data if needed.
In this course we’ve made a simple web scraper with Octoparse within few minutes. Since most data that can bring valuable insight is included in complex website, you can explore Octoparse deeper to try to make a web scraper that can collect some semi-structured and then convert it into structured data to make it much more usable for further processing. Happy scraping!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!