Web scraping/crawling is the processing of extracting specific content from a website without accessing an API to obtain the content.
How to build a crawler:
For programmers or developers, using python is the most common way to build a web scraper/crawler to extract web content. For example, the code in the screenshot below can be used to scrape data from a public website - pokemondb.net.
(picture from /gist.github.com/anchetaWern/6150297)
For most people who do not have coding skills, it would be better to use some web content extractors to get specific content from web pages. Below are some solutions using Octoparse:
1. Extract content from the dynamic web page
Web pages can be either static or dynamic. It’s often the case that the web content you want to extract would change throughout the day. It is often the case that the website will apply AJAX technique. Ajax allows the webpage to send and receive data from the background without interfering with the webpage display. In this case, you can check the AJAX option to allow Octoparse to extract content from dynamic web pages.
Check the AJAX timeout setting in Octoparse
2. Extract content that is hidden from the web page
Have you ever wanted to get specific data from a website but the content would appear after you trigger a link or hover the mouse pointer over? For example, some contact information on craigslist.org will appear after you click the Reply button.
In fact, such hidden content could be found in the HTML source code of this web page. Octoparse can extract the text between the source code. It’s easy to use the “Click Item” command or a “Cursor over” command under the “Action Tip” Panel to achieve the action of extraction.
3. Extract content from the web page with infinite scrolling
Check the "Scroll Down" option in Octoparse to extract content.
4. Extract hyperlinks from the web page
A normal website will contain at least one hyperlink and if you want to extract all the links from one web page, you can use Octoparse to help you extract all URLs of the whole website.
5. Extract text from the web page
If you want to extract the content place between HTML tags such as <DIV> tag or <SPAN> tag. Octoparse enables you to extract all the text between the source code.
6. Extract images URL from the web page
Octoparse could not download the image but the URL of the image.
Octoparse can extract anything displayed on the web page, and export to structured formats like Excel, CSV, HTML, TXT and other databases. However, Octoparse now is not able to download images, videos, Gif and canvas. We are expecting in the near future, these functions will be added to the updated version. Click HERE to download Octoparse and learn more from rich tutorials.
Artículo en español: Extraer Contenido de La Página Web
También puede leer artículos de web scraping en el sitio web oficial
Author: The Octoparse Team