Extract Content from Web Page
Tuesday, January 19, 2021Web scraping is the technique to get web content for our own use. It is widely used in all industries. For freelance writers, they may extract online articles for topic research. For businesses of all sizes, they extract data from websites to proceed business analysis. Here are some tips of how to get content from web pages.
How to get content from web pages
For programmers or developers, using python is the most common way to build a web scraper/crawler to extract web content. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.
For most people who do not have coding skills, it would be better to use some web content extractors to get specific content from web pages. Below are some solutions using Octoparse:
1. Extract content from the dynamic web page
Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change throughout the day. It is often the case that the website will apply AJAX technique. Ajax allows the webpage to send and receive data from the background without interfering with the webpage display. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.
Check the AJAX timeout setting in Octoparse
2. Extract content that is hidden from the web page
Have you ever wanted to get specific data from a website but the content would appear after you trigger a link or hover the mouse pointer over? For example, some contact information on craigslist.org will appear after you click the Reply button.
In fact, such hidden content could be found in the HTML source code of this web page. Octoparse can extract the text between the source code. It’s easy to use the “Click Item” command or a “Cursor over” command under the “Action Tip” Panel to achieve the action of extraction.
3. Extract content from the web page with infinite scrolling
You may also notice some messages are only uploaded once you scroll to the bottom of the web page like Twitter. This is because the websites apply infinite scroll. Infinite scroll usually accompanies AJAX or JavaScript to make the requests happen as you reach the end of the webpage. In this case, you can set the AJAX timeout, select the scrolling method and scrolling times to customize how you want the robot to extract the content.
4. Extract hyperlinks from the web page
A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can use Octoparse to help you extract all URLs of the whole website.
5. Extract text from the web page
If you want to extract the content place between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text between the source code.
6. Extract images URL from the web page
Octoparse could not download the image but the URL of the image.
Conclusion
Octoparse can extract anything displayed on the web page, and export to structured formats like Excel, CSV, HTML, TXT and other databases. However, Octoparse now is not able to download images, videos, Gif and canvas. We are expecting in the near future, these functions will be added to the updated version. Click HERE to download Octoparse and learn more from rich tutorials.
Artículo en español: Extraer Contenido de La Página Web
También puede leer artículos de web scraping en el sitio web oficial
Artikel auf Deutsch: Inhalt von Webseiten auslesen
Sie können unsere deutsche Website besuchen.
Author: The Octoparse Team
Top 20 Web Scraping Tools to Scrape the Websites Quickly
Top 30 Big Data Tools for Data Analysis
Web Scraping Templates Take Away
How to Build a Web Crawler - A Guide for Beginners
Video: Create Your First Scraper with Octoparse 7.X
Most popular posts
Posts by topic