undefined
Blog > Data Collection > Post

Extract Content from Web Page

Friday, June 28, 2019

Web scraping/crawling is the processing of extracting specific content from a website without accessing an API to obtain the content. 

 

How to build a crawler: 

For programmers or developers, using python is the most common way to build a web scraper/crawler to extract web content. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.

 

 (picture from /gist.github.com/anchetaWern/6150297)

 

For most people who do not have coding skills, it would be better to use some web content extractors to get specific content from web pages. Below are some solutions using Octoparse:

 

1. Extract content from the dynamic web page

Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change throughout the day. It is often the case that the website will apply AJAX technique. Ajax allows the webpage to send and receive data from the background without interfering with the webpage display. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.

 

 Check the AJAX timeout setting in Octoparse

 

2. Extract content that is hidden from the web page

Have you ever wanted to get specific data from a website but the content would appear after you trigger a link or hover the mouse pointer over? For example, some contact information on craigslist.org will appear after you click the Reply button.

 

 

In fact, such hidden content could be found in the HTML source code of this web page. Octoparse can extract the text between the source code. It’s easy to use the “Click Item” command or a “Cursor over” command under the “Action Tip” Panel to achieve the action of extraction.

 

 

3. Extract content from the web page with infinite scrolling

You may also notice some messages are only uploaded once you scroll to the bottom of the web page like Twitter.  This is because the websites apply infinite scroll. Infinite scroll usually accompanies AJAX or JavaScript to make the requests happen as you reach the end of the webpage. In this case, you can set the AJAX timeout, select the scrolling method and scrolling times to customize how you want the robot to extract the content.

 

Check the "Scroll Down" option in Octoparse to extract content.

 

4. Extract hyperlinks from the web page

A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can use Octoparse to help you extract all URLs of the whole website.

 

5. Extract text from the web page

If you want to extract the content place between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text between the source code.

 

6. Extract images URL from the web page

Octoparse could not download the image but the URL of the image.

 

 

 

Conclusion

Octoparse can extract anything displayed on the web page, and export to structured formats like Excel, CSV, HTML, TXT and other databases. However, Octoparse now is not able to download images, videos, Gif and canvas. We are expecting in the near future, these functions will be added to the updated version. Click HERE to download Octoparse and learn more from rich tutorials.

 

Artículo en español:  Extraer Contenido de La Página Web

También puede leer artículos de web scraping en el sitio web oficial

 

 

Author: The Octoparse Team 

Octoparse Download

 

More Resources

 

Top 20 Web Scraping Tools to Scrape the Websites Quickly

Top 30 Big Data Tools for Data Analysis

Web Scraping Templates Take Away

How to Build a Web Crawler - A Guide for Beginners

Video: Create Your First Scraper with Octoparse 7.X

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download