Blog > Data Collection > Post

Extract Content from Web Page

Tuesday, January 19, 2021

Web scraping is the technique to get web content for our own use. It is widely used in all industries. For freelance writers, they may extract online articles for topic research. For businesses of all sizes, they extract data from websites to proceed business analysis. Here are some tips of how to get content from web pages.


How to get content from web pages 

For programmers or developers, using python is the most common way to build a web scraper/crawler to extract web content. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.


For most people who do not have coding skills, it would be better to use some web content extractors to get specific content from web pages. Below are some solutions using Octoparse:


1. Extract content from the dynamic web page

Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change throughout the day. It is often the case that the website will apply AJAX technique. Ajax allows the webpage to send and receive data from the background without interfering with the webpage display. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.


 Check the AJAX timeout setting in Octoparse


2. Extract content that is hidden from the web page

Have you ever wanted to get specific data from a website but the content would appear after you trigger a link or hover the mouse pointer over? For example, some contact information on craigslist.org will appear after you click the Reply button.



In fact, such hidden content could be found in the HTML source code of this web page. Octoparse can extract the text between the source code. It’s easy to use the “Click Item” command or a “Cursor over” command under the “Action Tip” Panel to achieve the action of extraction.


3. Extract content from the web page with infinite scrolling

You may also notice some messages are only uploaded once you scroll to the bottom of the web page like Twitter.  This is because the websites apply infinite scroll. Infinite scroll usually accompanies AJAX or JavaScript to make the requests happen as you reach the end of the webpage. In this case, you can set the AJAX timeout, select the scrolling method and scrolling times to customize how you want the robot to extract the content.


4. Extract hyperlinks from the web page

A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can use Octoparse to help you extract all URLs of the whole website.


5. Extract text from the web page

If you want to extract the content place between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text between the source code.


6. Extract images URL from the web page

Octoparse could not download the image but the URL of the image.






Octoparse can extract anything displayed on the web page, and export to structured formats like Excel, CSV, HTML, TXT and other databases. However, Octoparse now is not able to download images, videos, Gif and canvas. We are expecting in the near future, these functions will be added to the updated version. Click HERE to download Octoparse and learn more from rich tutorials.


Artículo en español:  Extraer Contenido de La Página Web

También puede leer artículos de web scraping en el sitio web oficial

Artikel auf Deutsch:  Inhalt von Webseiten auslesen

Sie können unsere deutsche Website besuchen.




Author: The Octoparse Team 


More Resources


Top 20 Web Scraping Tools to Scrape the Websites Quickly

Top 30 Big Data Tools for Data Analysis

Web Scraping Templates Take Away

How to Build a Web Crawler - A Guide for Beginners

Video: Create Your First Scraper with Octoparse 7.X


We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline