Extract Content from Web Page

2/28/2017 5:03:42 AM

Web scraping/crawling is the processing of extracting specific content from a website without an API available for obtaining the web content. The legality of web scraping/crawling is still in flux and you’d better read the terms of service of the website you want to extract content from before web scraping to avoid doing something illegal unintentionally.

 

Let’s first define what web content is before we dive in. According to Wikipedia:

 

Web content is the textual, visual, or aural content that is encountered as part of the user experience on websites. It may include—among other things—text, images, sounds, videos, and animations.

In Information Architecture for the World Wide Web, Lou Rosenfeld and Peter Morville write, "We define content broadly as 'the stuff in your Web site.' This may include documents, data, applications, e-services, images, audio and video files, personal Web pages, archived e-mail messages, and more. And we include future stuff as well as present stuff.""

 

For programmers or developers, the programming languages and their packages make it easier and more funny to build a web scraper/crawler to extract the content mentioned above from the web pages. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.

 

                                                                                                                                  (picture from /gist.github.com/anchetaWern/6150297)

 

For most people who do not have any coding knowledge, it would be better to use some web content extractors to get the specific content from web pages. Below are some solutions using Octoparse for all your needs to:

 

1. Extract content from the dynamic web page.

Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change with the time of day when you get access to the website. Usually, this website is a dynamic website that uses AJAX techniques or other techniques to enable the content of the web pages to update in time. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.

 Check the AJAX timeout setting in Octoparse

 

2. Extract content that is hidden from the web page.

Have you ever wanted to get a specific data from a website but the content would appear after you trigger a link or hover mouse over somewhere? For example, some contact information on craigslist.org will appear after you click on the Reply button.

 

 

The information behind the reply button on Craigslist

 

In fact, this kind of previously hidden content could be found in the HTML source code of this web page, and Octoparse can extract all the text in the source code. It’s easy to use a “Click Item” action or a “Cursor over” action to extract these kinds of specific content from web pages.

 

3. Extract content from the web page with infinite scrolling.

Sometimes it’s difficult to extract content that will appear after you keep scrolling to the bottom of the web page. For example, you need to keep scrolling to the bottom of the web page to load more images with the browser scroll bar on imgur.com. Websites with infinite scrolling are usually using AJAX or JavaScript to request extra content from the site when you scroll to the bottom of the page or scroll down one screen. In this case, you can set the AJAX timeout setting and select the scrolling method and scrolling times to extract the content from the web pages.

 Check the "Scroll Down" option in Octoparse to extract content

 

 

 Check the AJAX timeout setting in Octoparse

 

4. Extract all the links from the web page.

A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can try out Octoparse to get all the hyperlinks posted on the web page.

 

5. Extract all the text from the web page.

Occasionally you need to extract all the text in the HTML document, that is, the content placed between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text inside the source code of the web page.

 

6. Extract all the images from the web page.

There is a gear need for extracting images from the web pages. Octoparse could not used to extract the image but the URL of the image.

 

Conclusion

Octoparse can extract almost all the content except video, flash and canvas from the web page. Click HERE to download Octoparse and learn more from rich tutorials.

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

Author's Picks

 

How to Crawl Data from a Website

Price Scraping | Octoparse, Free Web Scraping Software

Why Extracting Big Data Is Important

3 Best Article Scraping Software Tools

Scraping Data from Website to Excel

Free Online Web Crawler Tool

Web Crawler Service

 

 

Request Pro Trial Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.