Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Lesson 3: Getting data - Capture text from a page

Thursday, August 16, 2018

Now that you’ve downloaded Octoparse  on your PC and learn about the user interface , you are ready to start your own web scraping project. 
 
Extracting text data is the basic skill to acquire as most of the data are represented as visual text on the web, such as news articles, product information, blog, etc. In this lesson, I will go through how to capture simple text data from a webpage with simple point and click. Basic text extracting skill, when coupled with the other techniques such as pagination, list building lays the foundation to achieving data scraping on all kinds of webpages. 

So let's start with capturing text from a single web page. 

1) Start a new task and enter the URL of target web page [Download the task file in this lesson]

Once you're logged in, click "+ Task " button of Advanced Mode to create a new task. Then enter one or more URLs.

 

Tips!

1. What is a task?

A task means a crawler for scraping data from usually one website with unlimited Page/URL inquiries. Crawlers in Octoparse are determined by the scraping tasks configured. Scraping tasks would tell Octoparse: which website to open and what data to crawl, etc.

2. Why should I use Advanced Mode?

Advanced Mode is an incredibly powerful mode offering extended flexibility to accommodate scraping all different kinds of websites. It allows you to customize individual action needed to perform the extraction including keywords searching, login authentication, opening dropdowns, etc.

 
Here we take one of our blog posts as an example. Suppose our goal is to extract the blog information from the page.

Copy and paste the URL in "Extraction URL" textbox. Then click on "Save URL" and Octoparse will open the web page in the built-in browser.

URL: https://www.octoparse.com/blog/top-5-web-scraping-tools-comparison/

 

2) Click on target data to capture

Now, start to capture the data you need by clicking directly on the various pieces of information.

When the data is selected properly, the selection will be highlighted in green.

 

Click on the title of the post, the posted date or the content to capture.

 

Notice the data you clicked on is now showing on the Action Panel. You can edit the field names by clicking  or leave till later. Select "Extract Data" to complete the text extraction action. 

 

Tips!

1. Turn on “Workflow” button  to preview workflow you design.

2. Now in Octoparse 7.X, task name is automatically generated right on the top of configuration interface. To change it, just click the textbox and type in the desired name. Don’t forget to save your changes by clicking .

 

3) Save and run the task to capture data

Click "Save and run" from the Action Panel or alternatively, click "Start Extraction" to start running the completed task. 

 

Here is the data we got from running the task.  

 

 

 

 Lesson 4: Capture a list of items

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png