Web Scraping Tutorials: Scraping Source Code from Web Pages
Thursday, March 9, 2017 8:50 PMFor the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.
In this tutorial, we will show you how to use Octoparse to extract page-level data, including webpage URL, page title, meta description, meta keywords, and HTML source code.
How to add the data?
1. Click on the "Extract Data" action
2. Go to the "Data Preview"
3. Click on to add data field(s)
4. Hover on "Page-level data" to select the information that you want
The selected page-level data will be added as a field automatically to this "Extract Data" action.
5. Rename the data field as needed by double-clicking on the field name
Meaning of the fields
- Page URL: add the URL of the current page along with the corresponding data
It is useful when you would like to check the missing data fields on a page: What to do with those blank fields I got in the extracted result?
- Page title: scrape the content of the title tag in the HTML
It is a short description of a webpage and appears at the top of a browser window.
- Meta description: scrape the content of the meta description tag
The tag contains a summary of the page content.
- Meta keyword: scrape the content of the meta keyword tag
Scraping the page title, meta description, and meta keywords are useful when users need to improve their SEO.
- HTML source code: the complete HTML code of the web page
If you need any help with task configuration or data collection, submit a ticket to our support team! We'll get back to you soon.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.