1. Introduction


1.1. Why use Octoparse

                  . Point-and-click interface

                  . Deal with almost all the websites - dynamic or static

                  . Extract data from sites precisely

                  . Store or save your data

                  . Cloud service (Paid editions)

1.2. Basic Concept

                    1.2.1.  What is web scraping

                    1.2.2. AJAX ( Asynchronous JavaScript and XML)

                    1.2.3. HTML (Hypertext Markup Language)

                    1.2.4. API

                    1.2.5. Web Cookie 


1.1. Why use Octoparse

Octoparse is a modern visual web data extraction software. Both experienced and inexperienced users would find it easy to use Octoparse to bulk extract information from websites, for most of scraping tasks no coding needed. Octoparse makes it easier and faster for you to get data from the web without having you to code. It will automatically extract content from almost any website and allows you to save it as clean structured data in a format of your choice. You can also turn any data into custom APIs. Now you don’t have to hire tons of interns to copy and paste manually. You just need to make the rule for collecting data and Octoparse will do the rest. 


. Point and click interface

Simply point and click web elements, and Octoparse will identify all the data in a pattern and extracts any web data automatically. No coding required for most websites.


. Deal with almost all the websites - dynamic or static

                          . Extract text, image URLs, links, HTML, etc.

                          . Scrape category: a list/grid of links with similar structure

                          . Extract sites/contents loaded with Ajax, JavaScript and etc.

                          . Crawl websites with infinite scrolling

                          . Pagination

                          . Extract data behind login.

                          . Get data behind dropdown menus

                          . Capture data from search results pages


. Extract data from sites precisely

                          . XPath generator (Automatically)

                          . XPath: XPath tool (Manually) 

                          . RegEx: Built-in regular expression tool


. Store or save your data

                          . API

                          . Data file: CSV, Excel, HTML, Text

                          . Database: My SQL, SQL Server, Oracle


. Cloud Service (Paid editions)

                          . Bulk extract data using cloud servers 24/7

                          . Extract and store your data in the cloud with high speed

                          . Automatic IP rotation: Avoiding IP being blacklisted.

                          . Schedule your data extraction


1.2. Basic concepts


1.2.1.  What is web scraping?

Web scraping (also termed web data extraction, screen scraping, or web harvesting) is a computer software technique of extracting data from websites, and turning the unstructured data on the web into structured formats that can be stored on your computer or in the cloud platform.

Usually, data available on the Internet is only readable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’ s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate this kind of data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

Web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, etc. and the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.

Web scraping technique is usually implemented by web-scraping software tools. These tools interact with websites in the same way as you do when using a web browser like Chrome. In addition to display the data in a browser, web scrapers extract data from web pages and store them to a local folder or database. Octoparse is a smart web scraper, the value of which is that you can extract any web data easily and free, even collect a large amount of source data from some very complicated websites.


1.2.2. AJAX ( Asynchronous JavaScript and XML)

AJAX stands for Asynchronous JavaScript and XML, is is a set of web development techniques that allows a webpage to update portions of contents without having to refresh the page.

AJAX is a technique for creating fast and dynamic web pages. It allows web pages to be updated asynchronously by exchanging small amounts of data with the server behind the scenes. This means that it is possible to update parts of a web page, without reloading the whole page. Classic web pages, (which do not use AJAX) must reload the entire page if the content should change. Websites like Google Maps, Gumtree, Facebook, Gmail are using AJAX technique. Scraping websites which use AJAX technique, for example loading content with a “Load More” button, infinite scrolling, can sometimes be tricky. In this case the easiest and the best way to scrape AJAX driven websites is by using Octoparse. You don’t need to know much about Ajax to extract data.


1.2.3. HTML (Hypertext Markup Language)

HTML, as in Hypertext Markup Language is the basic programming language that is used to create web pages. Almost every single web page that you see is programmed in one way or other using HTML.

Within the HTML web page, there’ re two parts: the head and the body. The head is where you put all the information that may be relevant to the rest of the web page. The title is the title that you can see on the top of the web page. In the head, you can put things like title for the page.


Websites are usually written using HTML, which means that each web page is a structured document. When people look at the web and see data, it's just a webpage. It's trapped inside the HTML of the page. If you can release it, the impact will be huge. Now Octoparse enables you to pull data you want from websites written by HTML. The easiest way for non-developers to scrape HTML is to use a HTML scraping tool. Octoparse is are designed to extract and manipulate HTML document.


1.2.4. API

API, short for Application Programming Interface is a set of routine definitions, protocols, and tools for building software and applications. An API may be for a web-based system, operating system, database system, computer hardware, or software library. An API specification can take many forms, but often include specifications for routines, data structures, object classes, variables, or remote calls. POSIX, Microsoft Windows API, the C++ Standard Template Library, and Java APIs are examples of different forms of APIs.


To be clear, an API is the messenger that takes requests and tells the computer system what you want to do and then returns the response back to you.

Think of an API as a waiter in a restaurant. Imagine you’re sitting at the table with a menu of choices to order from. And the kitchen is the part of the system which will prepare your order. But what is missing here is the a critical link to communicate your order to the kitchen and deliver your food back to the table. That is where the waiter or API comes in.

There are many different types of APIs for operating systems, applications or websites. Windows, for example, has many API sets that are used by system hardware and applications — when you copy and paste text from one application to another, it is the API that allows that to work. Today you can create your own APIs by using Octoparse.


1.2.5. Web Cookie 

A web cookie (also called HTTP cookie, browser cookie or tracking cookie) is a small piece of text files that is stored in the user’s web browser by the website which the user is browsing. When you visit a website, web servers cannot figure out whether the HTTP requests are sent by the same web browser. In this case, additional data is added to the HTTP requests and sent to web servers. Generally, cookies contain information like user’s ID information, the browsing activities of the user on a site or other pieces of information such as names, account information, addresses, phone/card numbers, etc. Other kinds of cookies like authentication cookies are also very commonly used. The security of an authentication cookie generally depends on the security of the issuing website and the user's web browser, and on whether the cookie data is encrypted. Security vulnerabilities may allow a cookie's data to be read by a hacker, used to gain access to user data, or used to gain access to the website to which the cookie belongs.

Octoparse enables you to save the cookies of the current webpage. You don’t need to log in again when you return to the website or webpage.





Download Octoparse Today




We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline