Extract Text from HTML - Using RegExp Tool

Thursday, September 29, 2016 6:22 AM

(Download my extraction task of this tutorial HERE just in case you need it.)

You may doubt whether you could extract text from an HTML document as it usually contains tags that you don’t need. Or you may have queries whether you could scrape the hidden text behind the website which could be seen in the HTML document. Now you don’t have to worry anymore! 

In this tutorial, I will take octoparse.com, for example, to show you how to effectively extract text from HTML.

Step 1. Set up basic information

Choose “Advanced Mode”. ➜ Click “Start” ➜ Complete basic information. ➜ Click “Next” ➜ Enter the target URL in the built-in browser. ➜ Click the “Go” icon to open the webpage.

(URL of the example: http://www.octoparse.com/ )




Step 2. Extract the text you want from the HTML.

For example, you want to extract the text hidden behind the website unless you move your cursor over the section (see the example below).



There are two methods to extract the text in such a case. The first one is to extract the text directly.


Choose the section you want. ➜ Click the highlighted link. ➜ Select “Extract text”. All the text could be extracted.

The second one is to use our RegExp Tool.

Choose the section you want. ➜ Click the highlighted link. ➜ Click “Extract Inner HTML, including the page source code, text with format and images”.

Then you need to customize your data. ➜ Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step” ➜ Click “Match with Regular Expression”.

Click “Try RegEx Tool”. You could find that the text begins with “<p style=“>” and ends with “</p>” in the “Source Text”. ➜ Click “Start With” and paste “<p style=“>” ➜ Click “End With” and paste “</p>”➜ Click “Generate” ➜ Click “Match All” ➜ Click “Match” ➜ Click “Apply”.

The text would be shown in the “Output” browser. ➜ Click “OK” ➜ Click “Done”.

Step 3. Extract the Customized Results

Click the “Field Name” to modify the name ➜ Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.



Step 4. Export the Data

The data extracted will be shown in “Data Extracted” pane. Click the “Export” button to export the results to Tex file, Excel file or other formats and save the file to your computer.


Author: The Octoparse Team


Download Octoparse Today



For more information about Octoparse, please click here.

Sign up today!



Author's Picks



Octoparse Smart Mode -- Get Data in Seconds

Extract Text From HTML Document

HTML Scraper

Getting started with XPath 1

Getting started with XPath 2

Getting started with XPath 1

Collect Data from LinkedIn

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf





We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline