Is It Possible to Scrape An HTML Page with JavaScript ?

4 min read

What is AJAX?

After your JavaScript has been executed, whether it’s something you had to wait around for to finish, or take action to make happen — You scrape the resulting HTML! If there’s content you can see in your browser, there’s HTML there. You don’t need special tools to scrape JavaScript pages (other than the tools necessary to execute the JavaScript, or trigger it to execute) just like you don’t need special tools to scrape .aspx pages and PHP pages.

JavaScript is a high-level, dynamic,untyped, and interpreted programming language. It has been standardized in the ECMAScript language specification. Alongside HTML and CSS, it is one of the three core technologies of World Wide Web content production: the majority of websites employ it and it is supported by all modern web browsers without plug-ins. JavaScript is prototype-based with first-class functions. (Source from Wikipedia) It is most commonly used as part of web browsers, whose implementations allow client-side scripts to interact with the user, control the browser, communicate asynchronously, and alter the document content that is displayed. It is also being used in server-side programming, game development and the creation of desktop and mobile applications. The most common use of JavaScript is to add client-side behavior to HTML pages, a.k.a. Dynamic HTML (DHTML), for example, loading new page content or submitting data to the server via AJAX without reloading the page.

Ajax, short for Asynchronous JavaScript and XML, is is a set of web development techniques that allows a web page to update portions of contents without having to refresh the page.

In fact, you don’t need to know much about Ajax to extract data. All you need is just to figure out whether the site you want to scrape uses Ajax or not. Many websites use a lot of Ajax such as Google, Amazon and eBay. Usually the URL of the page will not have any change when updating part of the content. With Octoparse, you can easily extract data from web pages where data is loaded with Ajax.

But how do I know if a web page loads content using AJAX?

When you have a click action to load web data, it is rather straightforward to tell if AJAX is being used. When AJAX is used, the web page loads the additional content without reloading the page. Hence, the reloading icon is a good indicator to tell apart if AJAX’s been used.

  • When there’s AJAX involved, the page should not reload when additional content gets loaded. So there should be NO reloading sign in this case.
  • If there’s no AJAX involved, you should see the page reloads with the reloading icon running when you click to load more information.

How to deal with AJAX

Octoparse uses reloading as a signal when executing the clicked item. If the page reloads after clicking an element, it will execute the next action after the reload finishes. But as pages with AJAX do not reload, Octoparse doesn’t receive the signal to act and would get stuck. So we need to set up an AJAX timeout for the “Click Item” or “Click to Paginate” to tell Octoparse to go to the next action when the timeout is reached. There are two ways AJAX can be taken care of in Octoparse.

  • AJAX auto-detection

Octoparse would set up AJAX timeout automatically when AJAX is detected for the page.

For example, Walmart’s website uses AJAX to load the next page. So when we choose to click the next page button, Octoparse automatically sets up AJAX timeout for the action.

If you need a longer or shorter timeout, simply click on the dropdown menu and choose the one you’d like.

  • Set up AJAX manually

When a task is built manually or if Octoparse fails to detect AJAX, it is also possible to set it up manually by clicking the Click Item box or Click to Paginate. You can find the AJAX settings in the Options and tick “Load with AJAX” to select the timeout you want.

Even for pages that do not use AJAX, AJAX timeout can still be used to ameliorate prolonged wait time for some pages. For example, if you have a page that is taking forever to load, long after the information you need has been loaded, you may want to use AJAX timeout to “force” Octoparse to move on to the next step instead of having Octoparse wait until the page loading to finish.

Happy data hunting.

Hot posts

Explore topics

Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today


Related Articles

  • avatarAbigail Jones
    Speaking about handling AJAX and JavaScript while web scraping, sometimes it could be tricky, especially when you are a tech noob. If you are looking for an easy and quick way to do this, especially for large workloads, you may want to look into some third-party applications for scraping websites with AJAX.
    December 20, 2021 · 2 min read
  • avatarAnsel Barrett
    You can learn about JaveScript and Java from this post, also the differences between them.
    December 20, 2021 · 3 min read
  • avatarAnsel Barrett
    JS is a dynamic computer programming language and you may encounter JS webpages during web scraping. How to scrape JS web pages? This is what we will discuss in this article.
    January 31, 2021 · 1 min read
  • avatarAbigail Jones
    HTML, as in Hypertext Markup Language is the basic programming language that is used to create web pages. Almost every single web page that you see is programmed in one way or other using HTML. No matter what other types of language you learn, weather JavaScript, Php,or anything like that. The main programming language that holds everything else together on the web is HTML.
    July 14, 2016 · 3 min read