Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Select and extract data/URL/image/HTML

Thursday, August 16, 2018

In this tutorial, we will show you how to use Octoparse to extract text, URL, image, and HTML.

 

But before we start, let's get a glance at how Octoparse scraps the data you need.

While building a new task, usually you will begin by selecting the data you want on the web page for Octoparse to scrape. To select elements on the page, you need to create a selection. Generally, there are two steps to create the selection:

1. Click on your target data

2. Select the appropriate action, such as "Select all" and "Extract text of the selected element", to perform from "Action Tips" 

When you click on the element you need, the selection area would be in a green box. You can also find that there are some other elements on the page highlighted in a red box at the same time. This is because Octoparse intelligently figures out the specific pattern which represents the selected element on the web page, and automatically selects the other elements of the similar pattern as you may want to capture them all.

Once the selection is created, all similar elements across multiple pages will be detected and added into the selection based on the pattern. Octoparse will then repeatedly execute scraping until every element in the selection is extracted.

Now, you've known Octoparse better. Let's see how to select and extract three specific types of data with Octoparse!

 

1) Extract Text

2) Extract the URL of a link or an image 

3) Extract inner/outer HTML

 

 

 

 

 

 

1) Extract Text

Most of the data are represented as human-readable text on the web, such as news articles, product information, and blog. So once you acquire the skill to extract text data, when later coupled with other techniques like pagination and list building, you are able to achieve data scraping on almost all kinds of web pages.

Let's see how to select and extract the text data with Octoparse.

 

1. Click on the target data you want

When you click on the element you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in red.

2. Create the selection

Click "Select all". The similar elements in a red box on the web page will be highlighted in green, and you can notice the selection is created in "Action Tips". Octoparse will then repeatedly execute scraping until the text of every element in the selection is extracted.

3. Extract text

Click "Extract text of the selected elements" to finish creating the selection.

 

 

 

2) Extract the URL of a link or an image

Colloquially, a URL is a hyper link. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.

Besides a web page, the URL also enables you to access to the specific file resource via the Internet, such as an image. If you get the URL, you can download the correspondent file or image from the Internet.

Let's see how to select and extract the URL of a link or an image with Octoparse.

 

1. Click on the link/image you want

When you click on the link/image you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in a red box.

 

Tips!

When you select an item with URL, the selected tag on the bottom of “Action Tips” should be “A”, which stands for anchor that usually links one page to another. To create a correct pattern to scrape all elements, make sure you select the right area.

 

2. Create the selection

Click "Select all". The similar elements in a red box on the web page will be highlighted in green, and you can notice the selection is created in "Action Tips". Octoparse will then repeatedly execute scraping until the text of every element in the selection is extracted.

3. Extract the URL

Click "Extract the URLs of the selected elements"/ "Extract image URL in the loop" to finish creating the selection.

 

Tips!

Can I just use Octoparse to directly get an image, not its URL, from the web page?

Unfortunately, you can’t use Octoparse to extract the image itself. If you want to extract images, you can scrape the URLs of the images with Octoparse first, and then bulk download the images with a “download from URL” tool.

  

 

 

 

3) Extract inner/outer HTML

Unlike the text and URL, data like icons are not available to be extracted directly. When you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.

Besides icons, you can also scrape hidden texts, charts and graphs from a web page by extracting the HTML of these elements first.

To get the data behind icons, then you need to apply regular expressions to clean the data up.

First let's see how to select and extract inner/outer HTML with Octoparse. 

 

1. Click on the target data you want

When you click on the element you need, the selection area would be in a green box. Similar elements on the web page will be highlighted in red.  

2. Extract inner/outer HTML

Click "Extract inner/outer HTML of the selected" in "Action Tips" to finish creating the selection.

 

Tips!

Octoparse provides both useful features and tools for you to apply regular expressions.

Related articles

Extract Text from HTML - Using RegExp Tool 

Use Regular Expression to Reformat Captured Data 

Re-format data extracted 

 

 

Related articles:

Use lists to extract 

Extract multiple pages through pagination 

Extract behind a login 

Extract from source code 

Extract page-level data 

Extract from a list of URLs 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png