Extract Text from HTML - Using RegExp ToolThursday, September 29, 2016 6:22 AM
(Download my extraction task of this tutorial HERE just in case you need it.)
You may doubt whether you could extract text from an HTML document as it usually contains tags that you don’t need. Or you may have queries whether you could scrape the hidden text behind the website which could be seen in the HTML document. Now you don’t have to worry anymore!
In this tutorial, I will take octoparse.com, for example, to show you how to effectively extract text from HTML.
Step 1. Set up basic information
Choose “Advanced Mode”. ➜ Click “Start” ➜ Complete basic information. ➜ Click “Next” ➜ Enter the target URL in the built-in browser. ➜ Click the “Go” icon to open the webpage.
(URL of the example: http://www.octoparse.com/ )
Step 2. Extract the text you want from the HTML.
For example, you want to extract the text hidden behind the website unless you move your cursor over the section (see the example below).
There are two methods to extract the text in such a case. The first one is to extract the text directly.
Choose the section you want. ➜ Click the highlighted link. ➜ Select “Extract text”. All the text could be extracted.
The second one is to use our RegExp Tool.
Choose the section you want. ➜ Click the highlighted link. ➜ Click “Extract Inner HTML, including the page source code, text with format and images”.
Then you need to customize your data. ➜ Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step” ➜ Click “Match with Regular Expression”.
Click “Try RegEx Tool”. You could find that the text begins with “<p style=“>” and ends with “</p>” in the “Source Text”. ➜ Click “Start With” and paste “<p style=“>” ➜ Click “End With” and paste “</p>”➜ Click “Generate” ➜ Click “Match All” ➜ Click “Match” ➜ Click “Apply”.
The text would be shown in the “Output” browser. ➜ Click “OK” ➜ Click “Done”.
Step 3. Extract the Customized Results
Click the “Field Name” to modify the name ➜ Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 4. Export the Data
The data extracted will be shown in “Data Extracted” pane. Click the “Export” button to export the results to Tex file, Excel file or other formats and save the file to your computer.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!