In this tutorial, we will show you how to extract text, URL, image URL, HTML, and other attribute values.

1. Extract Text

Click on your target data then select Text from the Tips panel

2. Extract the URL (of a link or an image)

A URL is a hyperlink. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.

Besides a web page, the URL also enables you to access the specific file resource via the Internet, such as an image or a PDF doc. If you get the URL, you can download the corresponding file or image from the Internet via the URL.

2.1 Extract the URL of a link

Click on your target data then select Link from the Tips panel

TIP: When you select an item with a URL, the selected tag on the bottom of "Tips" should be "A", which stands for an anchor that usually links one page to another. Please make sure you select the right area.

2.2 Extract the image URL

Click on your target data then select Image URL from the Tips panel.

FAQ: Can I use Octoparse to directly get an image, not its URL, from the web page?

Yes! With the brand new scrape and download feature introduced in version 8.5.4, you can now download the image directly while scraping.

3. Extract the inner/ outer HTML

Unlike the text and URL, data like icons are not available to be extracted directly. If you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.

Besides icons, you can also scrape hidden texts, charts, and graphs from a web page by extracting the HTML of these elements first. After getting the HTML code, you need to apply regular expressions to clean up the data.

To extract inner/ outer HTML, click on your target data then select Inner/Outer HTML from the Tips panel.

TIP: To refine the extracted inner/outer HTML into useful data, you might want to check out these tutorials -

4. Extract attribute value

Attributes are within the HTML code, providing additional information about HTML elements. For example, the star rating is usually stored in the attribute. It usually comes in name/value pairs like name="value". Octoparse can help to scrape the value directly.

Click on the target element (here we take the star rating as an example) and select OuterHtml.

Go to the Data Preview section, hover over the name field, click on the ... more button, select Customize field, then choose your target attribute in the Extract attribute

Refine extracted data (replace content, add a prefix, etc.)

Select the correct HTML tag for web elements

Extract data

What types of websites/data can Octoparse scrape?