Step-by-step tutorials for you to get started with web scraping
Download OctoparseExtract data from source code
Monday, December 27, 2021Psst! You are browsing a tutorial for Octoparse version 7.3, which is slowly on its way out. We strongly recommend that you update to the latest version 8.4 to enjoy all the exciting new features. You can also visit our new help center and check out the latest tutorials!
Question: What is source code?
Answer: Source code is the original text version of a web page written in programming languages. So, it contains all the information of a web page. You can view the source code of any web page by right-clicking and selecting "View Page Source" in a browser.
When do you need to scrape from source code?
When the data you need is shown in the form of non-text contents, like star rating, you may not be able to extract the rating directly using "Extract text of the element" as the number value is not directly visible on the page (only the stars); however, you can still capture this valuable piece of information from the source code-HTML . In other situations, the data you require may be scrambled with other messy data as it gets extracted directly as text; in this case, you can try to scrape the data from HTML.
Octoparse supports extracting data from source code directly. In this tutorial, we will show you how to extract from inner HTML and outer HTML.
1) Extract data from inner HTML
2) Extract data from outer HTML
3) Data reformat tools related HTML
1) Extract data from inner HTML
HTML is the standard markup language for creating web pages. When we extract the inner HTML of an element on the page, we will get the HTML markup contained within the element. So, for the information shown in the form of a picture or icon, we can capture its inner HTML first, then further extract the target data from the extracted code by using data reformat tools.
Take the star rating of a restaurant on Yelp.com as an example.
- Locate "star-rating" on the webpage and click it
- Select "Extract inner HTML of the selected element" from the Action Tips panel
Switch to the Workflow Mode by toggling the Workflow switch . The extracted inner HTML had been added to the "Data field" section,
<img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png
" alt="4.0 star rating" height="303" width="84">
Notice the number value of the star rating (4.0) is included in the code extracted though it was not directly available on the web page. Now we have the code, we can further pinpoint "4.0" from it via reformatting the data with Regular Expression (learn more about reformatting HTML in Part 3).
2) Extract data from outer HTML
Outer HTML is an element property that includes the opening and the closing tags as well as the content. So, capturing the outer HTML can technically provide more information than inner HTML. If the information needed cannot be found in the inner HTML, it is still possible to locate it in the outer HTML.
The steps to extract outer HTML is similar to that of inner HTML:
- Click the data needed
- Select "Extract outer HTML of the selected element" from "Action Tips"
The outer HTML of the star rating is as follow:
<div style="background-color: rgb(229, 245, 233); outline: 1px solid rgb(0, 162, 59);" class="i-stars i-stars--large-4-rating-very-large" title="4.0 star rating">
<img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png
" alt="4.0 star rating" height="303" width="84"> </div>
As you can see, the inner HTML (highlighted in blue) is part of the outer HTML. Once extracted, the target data (4.0) can be further captured using Regular Expression tool in a similar way.
Tips! 1. How to extract the full HTML of a web page? Extracting the full HTML enables you to have all the information of a webpage, and it is easy.
2. Why is there no "Extract inner HTML ..." or "Extract outer HTML..." available on "Action Tips"? The options provided on "Action Tips" vary according to the data you select. Try to expand the selection by clicking on the expansion icon at the bottom of "Action Tips". |
3) Reformat data with RegEx tools
Data reformat tools are very helpful to process the data extracted, to clean the data. There are 8 built-in data reformat tools in Octoparse. For the purpose of this tutorial, we'll cover two HTML-related reformat tools.
To access the data reformat tools,
- Select the data field to reformat
- Click on
to customize the field
- Click "Refine extracted data"
- Click "Add step"
1. HTML Transcoding
Once you have the inner/outer HTML code extracted, you can convert the HTML tags into plain text using "HTML transcoding". For example, transcode ">" into ">" and " " into a space.
- Select "HTML transcoding"
- Click "Evaluate" and confirm the output
- Click "OK" to save the settings
2. Match with Regular Expression
- Select "Match with Regular Expression"
- Click "Try RegEx Tool"
- Enter the match criteria: start with " alt=" ", end with "star rating"
- Click "generate", then "Match", you will see the number value of star rating (4.0) is matched.
- Click "Apply"
- Click "OK" to save the settings
Tips! If you are interested in learning the other data reformat tools, see this tutorial |
Related Articles:
Case tutorial | scrape business information from Yelp.com
Definition of source code in Wikipedia
Learn more about HTML in W3schools
Download Octoparse to start web scraping or contact us for any
question about web scraping!