Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)Sunday, May 29, 2016 6:41 AM
Welcome to Octoparse’s tutorial.
We often encounter such a problem: the data format we collected from web page is not what we want. Some data we collected often have some extra characters or modified text. In this case we need to clean some data. In Octoparse, we can use regular expression to remove unwanted data and extract exactly what we want. It’s very convenient.
In this document, we want to extract some stock data of a company. The data values on the web page is comma-delimited data. We need to simply deal with these values and remove the commas and spaces when collecting data so that we can better process the data values for further analysis. I’m going to show you how to use the regular expression tool and reformat the data. Let’s get started.
I’ll take the page on finance.yahoo.com for example. The website is http://finance.yahoo.com/q?s=%5Egsp
Set the basic information for the task and open the website in the built-in browser.
After the web page is loaded, extract the previous close, day's range, opening price and 52-week range. Define their names.
We will remove the commas of the first two fields, the previous close and day's range.
Choose the first field. Then click “Customize Field” -> “Re-format extracted data”.
We can see the original value and final output here. “Add step” -> “Replace with Regular Expression”
We’ll enter the regular expressions in the pop-up window.
We can directly enter a regular expression that match a comma in the input box. Here we enter a comma. And leave blank in the “Replace with” box. Hit “Calculate” and the output is the value we want.
Or, we can enter regular expressions that exclude all the numbers and full-stop inside square brackets. Here we enter [^\d\.] and leave blank in the “Replace with” box. Then hit “Calculate”, and we also get what we want.
You can try our regular expression tool here.
Then click OK, and click done.
Choose the second field. Then click “Customize Field” -> “Re-format extracted data” -> “Add step” -> “Replace with Regular Expression”
Similarly, we directly enter regular expressions that match all the commas and spaces in the input box. Here we enter ,|\s and leave blank in the “Replace with” box. Hit “Calculate”, and we remove the commas and spaces.
Or, we enter regular expressions that exclude all the numbers, hyphen and full-stops inside square brackets. Here we enter [^\d\.-] and leave blank in the “Replace with” box. Then hit “Calculate”, and we get the same output.
Click OK, and click done.
See the first two fields, we have removed the commas and spaces from the values.
Click save. And we can extract data now.
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.