Extract Content from Web PageTuesday, February 28, 2017
Web scraping/crawling is the processing of extracting specific content from a website without an API available for obtaining the web content. The legality of web scraping/crawling is still in flux and you’d better read the terms of service of the website you want to extract content from before web scraping to avoid doing something illegal unintentionally.
Let’s first define what web content is before we dive in. According to Wikipedia:
“Web content is the textual, visual, or aural content that is encountered as part of the user experience on websites. It may include—among other things—text, images, sounds, videos, and animations.
In Information Architecture for the World Wide Web, Lou Rosenfeld and Peter Morville write, "We define content broadly as 'the stuff in your Web site.' This may include documents, data, applications, e-services, images, audio and video files, personal Web pages, archived e-mail messages, and more. And we include future stuff as well as present stuff.""
For programmers or developers, the programming languages and their packages make it easier and more funny to build a web scraper/crawler to extract the content mentioned above from the web pages. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.
(picture from /gist.github.com/anchetaWern/6150297)
For most people who do not have any coding knowledge, it would be better to use some web content extractors to get the specific content from web pages. Below are some solutions using Octoparse for all your needs to:
1. Extract content from the dynamic web page.
Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change with the time of day when you get access to the website. Usually, this website is a dynamic website that uses AJAX techniques or other techniques to enable the content of the web pages to update in time. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.
Check the AJAX timeout setting in Octoparse
2. Extract content that is hidden from the web page.
Have you ever wanted to get a specific data from a website but the content would appear after you trigger a link or hover mouse over somewhere? For example, some contact information on craigslist.org will appear after you click on the Reply button.
The information behind the reply button on Craigslist
In fact, this kind of previously hidden content could be found in the HTML source code of this web page, and Octoparse can extract all the text in the source code. It’s easy to use a “Click Item” action or a “Cursor over” action to extract these kinds of specific content from web pages
3. Extract content from the web page with infinite scrolling.
Check the "Scroll Down" option in Octoparse to extract content.
4. Extract all the links from the web page.
A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can try out Octoparse to get all the hyperlinks posted on the web page.
5. Extract all the text from the web page.
Occasionally you need to extract all the text in the HTML document, that is, the content placed between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text inside the source code of the web page.
6. Extract all the images from the web page.
There is a gear need for extracting images from the web pages. Octoparse could not be used to extract the image but the URL of the image.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Most popular posts
- Related articles
- Web Scraping for Sports Stats
- Extract Content from Web Page
- Web Crawling | How to Build a Crawler to Extr...
- Best Data Scraping Tools for 2019 (Top 10 Rev...
- Two Fastest Ways for Startups to Build Your E...