Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Extract data from source code

Thursday, August 16, 2018

Question: What is source code?

AnswerSource code  is the original text version of a web page written in programming languages. So, it contains all the information of a web page. You can view the source code of any web page by right-clicking and selecting "View Page Source" in a browser.

 

Why do you need to scrape from source code?

When the data you need is shown in the form of non-text contents, like star rating, you may not be able to extract the rating directly using "Extract text of the element" as the number value is not directly visible on the page (only the stars); however, you can still capture this valuable piece of information from the source code-HTML . In other situations, the data you require may be scrambled with other messy data as it gets extracted directly as text; in this case, you can try scrape the data from HTML.

Octoparse supports extracting data from source code directly. In this tutorial, we will show you how to extract from inner HTML and outer HTML.

 

1) Extract data from inner HTML

2) Extract data from outer HTML

3) Data reformat tools related HTML

 

 

 

 

 

 

1) Extract data from inner HTML 

HTML is the standard markup language for creating web pages. When we extract the inner HTML of an element on the page, we will get the HTML markup contained within the element. So, for the information shown in the form of a picture or icon, we can capture its inner HTML first, then further extract the target data from the extracted code by using data reformat tools.

Take the star-rating of a restaurant on Yelp.com as an example.

  • Click the "star-rating"
  • Select "Extract inner HTML of the selected element"

web scraping with octoparse - scrape html 

Switch to the Workflow Mode by toggling the Workflow switch web scraping with octoparse - scrape html. The extracted inner HTML had been added to "Data field",

         <img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png " alt="4.0 star rating" height="303" width="84">

Notice the number value of the star rating (4.0) is included the code extracted though it was not directly available on the web page. Now we have the code, we can further pinpoint "4.0" from it via reformatting the data with Regular Expression (learn more about reformatting HTML in Part 3). 

 

 

 

 

 

2) Extract data from outer HTML 

Outer HTML is an element property that includes the opening and the closing tags as well as the content. So, capturing the outer HTML can technically provide more information than inner HTML. If the information needed cannot be found in the inner HTML, it is still possible to locate it in the outer HTML.

The steps to extract outer HTML is similar to that of inner HTML:

  • Click the data needed
  • Select "Extract outer HTML of the selected element" from "Action Tips"

 

The outer HTML of the star rating is as follow:

        <div style="background-color: rgb(229, 245, 233); outline: 1px solid rgb(0, 162, 59);" class="i-stars i-stars--large-4-rating-very-large" title="4.0 star rating">

        <img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png

" alt="4.0 star rating" height="303" width="84"> </div>

As you can see, the inner HTML (highlighted in blue) is part of the outer HTML. Once extracted, the target data (4.0) can be further captured using Regular Expression tool in similar way (Skip to the step).  

Tips!

1. How to extract the full HTML of a web page?

Extracting the full HTML enables you to have all the information of a web page, and it is easy.

  • Select any element in the page, click web scraping with octoparse - scrape html at the bottom of "Action Tips"
  • Select "HTML" in the drop-down list
  • Select "Extract outer HTML of the selected element". Now you've captured the full HTML of the page!

web scraping with octoparse - scrape html 

 

2. Why is there no "Extract inner HTML ..." or "Extract outer HTML..." available on "Action Tips"? 

The options provided on "Action Tips" vary according to the data you select. Try to expand the selection by clicking on the expansion icon at the bottom of "Action Tips".

 

 

 

3) Reformat data with RegEx tools

Data reformat tools are very helpful to process the data extracted, to clean the data. There are 8 built-in data reformat tools in Octoparse. For the purpose of this tutorial, we'll cover two HTML related reformat tools.

To access the data reformat tools,

  • Select the data field to reformat
  • Click on web scraping with octoparse - scrape html to customize the field
  • Click "Refine extracted data"
  • Click "Add step"

 

 

1. HTML Transcoding

Once you have the inner/outer HTML code extracted, you can convert the HTML tags into plain text using "HTML transcoding". For example, transcode "&gt" into ">" and "&nbsp" into a space.

  • Select "HTML transcoding"
  • Click "Evaluate" and confirm the output
  • Click "OK" to save the settings

 

2. Match with Regular Expression

  • Select "Match with Regular Expression"
  • Click "Try RegEx Tool"
  • Enter the match criteria: start with " alt=" ", end with "star rating"
  • Click "generate", then "Match", you will see the number value of star rating (4.0) is matched. 
  • Click "Apply" 
  • Click "OK" to save the settings

 

Tips

If you are interested in learning the other data reformat tools, see this tutorial .

 

 

Related Articles:

Data reformat tools 

Case tutorial | scrape business information from Yelp.com 

Definition of source code in Wikipedia 

Learn more about HTML in W3schools  

Inner HTML 

Outer HTML 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png