Is It Possible to Scrape An HTML Page with JavaScript ?

Friday, December 24, 2021



After your JavaScript has been executed, whether it's something you had to wait around for to finish, or take action to make happen -- You scrape the resulting HTML! If there's content you can see in your browser, there's HTML there. You don't need special tools to scrape JavaScript pages (other than the tools necessary to execute the JavaScript, or trigger it to execute) just like you don't need special tools to scrape .aspx pages and PHP pages.

JavaScript is a high-level, dynamic,untyped, and interpreted programming language. It has been standardized in the ECMAScript language specification. Alongside HTML and CSS, it is one of the three core technologies of World Wide Web content production: the majority of websites employ it and it is supported by all modern web browsers without plug-ins. JavaScript is prototype-based with first-class functions. (Source from Wikipedia) It is most commonly used as part of web browsers, whose implementations allow client-side scripts to interact with the user, control the browser, communicate asynchronously, and alter the document content that is displayed. It is also being used in server-side programming, game development and the creation of desktop and mobile applications. The most common use of JavaScript is to add client-side behavior to HTML pages, a.k.a. Dynamic HTML (DHTML), for example, loading new page content or submitting data to the server via AJAX without reloading the page.

Ajax, short for Asynchronous JavaScript and XML, is is a set of web development techniques that allows a web page to update portions of contents without having to refresh the page.

In fact, you don’t need to know much about Ajax to extract data. All you need is just to figure out whether the site you want to scrape uses Ajax or not. Many websites use a lot of Ajax such as Google, Amazon and eBay. Usually the URL of the page will not have any change when updating part of the content. With Octoparse, you can easily extract data from web pages where data is loaded with Ajax.


Ajax Case: Gumtree.com

On this page, it has contact details that need us to click the Reveal button to get the complete number. When we click “Reveal”, the rest of the contact number comes out and look at the URL, it doesn't have any change.

So we know this page uses AJAX and we need to set "Load with Ajax" in Octoparse. If not, the result cannot be extracted.


First, open the page in the built-in browser.


Then click on "Start". Select “Click an item”.


This page uses Ajax, so we need to set "Load page with Ajax".

Choose “Load page with Ajax”. Set an Ajax timeout. Click “Save”.


Then extract information you want.


Extract brand: Click on the title. Select “extract data.


Extract price: And extract contact details you just click.


Then you run the local extraction and the data you are looking for would be extracted.


Now you know how to extract data from web pages loaded with Ajax. 

Happy data hunting.


Author: The Octoparse Team




