Extract Text from HTML - Using RegExp Tool

Thursday, September 29, 2016 6:22 AM

(Download my extraction task of this tutorial HERE just in case you need it.)

You may doubt whether you could extract text from an HTML document as it usually contains tags that you don’t need. Or you may have queries whether you could scrape the hidden text behind the website which could be seen in the HTML document. Now you don’t have to worry anymore! 

In this tutorial, I will take octoparse.com for example to show you how to effectively extract text from HTML.

Step 1. Set up basic information

Choose “Advanced Mode”. ➜ Click “Start” ➜ Complete basic information. ➜ Click “Next” ➜ Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.

(URL of the example: http://www.octoparse.com/ )

 

Step 2. Extract the text you want from the HTML.

For example, you want to extract the text hidden behind the website unless you move your cursor over the section (see the example below).

 

 

There are two methods to extract the text in such case. The first one is to extract the text directly.

Choose the section you want. ➜ Click the highlighted link. ➜ Select “Extract text”. All the text could be extracted.

 

The second one is to use our RegExp Tool.

Choose the section you want. ➜ Click the highlighted link. ➜ Click “Extract Inner HTML, including the page source code, text with format and images”.

Then you need to customize your data. ➜ Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step” ➜ Click “Match with Regular Expression”.

 

Click “Try RegEx Tool”. You could find that the text begins with “<p style=““>” and ends with “</p>” in the “Source Text”. ➜ Click “Start With” and paste “<p style=““>” ➜ Click “End With” and paste “</p>”➜ Click “Generate” ➜ Click “Match All” ➜ Click “Match” ➜ Click “Apply”.

The text would be shown in the “Output” browser. ➜ Click “OK” ➜ Click “Done”.

 

Step 3. Extract the Customized Results

Click the “Field Name” to modify the name ➜ Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.

 

Step 4. Export the Data

The data extracted will be shown in “Data Extracted” pane. Click “Export” button to export the results to Tex file, Excel file or other formats and save the file to your computer.

 

 

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

 

Octoparse Smart Mode -- Get Data in Seconds

Extract Text From HTML Document

HTML Scraper

Getting started with XPath 1

Getting started with XPath 2

Getting started with XPath 1

Collect Data from LinkedIn

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
btn_sidebar_use.png
btn_sidebar_form.png