Making a Simple Web Scraper with OctoparseTuesday, February 18, 2020
To make good use of the content posted on the web, we can extract the data from the web for valid purposes, legally. This process is called web scraping, and the tool used during this data extraction is called web scraper.
Usually, we copy and paste web content manually if we don’t know how to program. Using this traditional web extraction method is extremely time-consuming and inefficient. Besides, most of the information on the website is written in different forms on the web - within an HTML tag or an HTML attribute. Therefore, it’s better for non-programmers to use some web scraping software that can grab the exact content you want to pull out of the website and combine the data with your own system/database.
(picture from neerajkumar.name)
I assume you want to extract data from websites manually when you are reading this article and thinking of making a simple web scraper. In fact, it’s easy to make such a simple web scraper with some automated web data extraction software and you don't even need to know how to write code. All you need is to pick the right tool to help you. So how to choose the best software to help make a simple web scraper with so many web data extraction software to choose from?
What is the first thing that comes into your mind then? Well, it’s best that the software is free. Then, it is a great option to select Octoparse, a powerful automated data extraction software that offers advanced features to help you extract all the text in the HTML documents. Click HERE to learn more about Octoparse.
It would be easier to understand how a web scraper work if you know the structure of a web page. Let’s get started to make a simple web scraper using Octoparse old version- extracting the title and URLs of all the case tutorials from octoparse.com.
Check out the latest version of this article with Octoparse 7.X: How to Build a Web Crawler – A Guide for Beginners
Step 1. Download Octoparse and launch it. Choose the Wizard Mode and click on the “Start” button.
Step 2. Click on the “Create” button under “List and Detail Extraction”, then enter the basic information for the web scraper.
Step 3. Enter the URL from which we want to pull data.
Step 4. Click random two items of the web page and click on the “Next” button.
Step 5. Check the “Enable pagination” option, and go to the bottom of the web page to click the “Next Page” link with 4 times, then click on the “Next” button. Octoparse will take you to the tutorial detail page.
Step 6. Click the content you want from the tutorial, and click on the “Next” button.
Step 7. Now you are done making a simple web scraper! Click “Local Extraction” to begin extracting data from octoparse.com.
The data extraction results screen appears in the Data Extracted pane. You can export the data if needed.
In this tutorial, we’ve made a simple web scraper with Octoparse within a few minutes. Since most data that can bring valuable insight is included in complex website, you can explore Octoparse to try to make a web scraper to collect some semi-structured data and then convert it into structured data to further process it. Happy scraping!
Author: The Octoparse Team