Deal with AJAX

Sunday, April 08, 2018 7:58 AM

In this tutorial, you will learn how to deal with AJAX with Octoparse in data scraping.

AJAX stands for Asynchronous JavaScript and XML. It is a set of web development techniques that allows a web page to update portions of contents without having to refresh the page. When you update a web page that applies AJAX by clicking, no reloading sign likewill be displayed.

 

 

 

 

 

1) Why do I need to deal with AJAX when using Octoparse?

While scraping data from the web, Octoparse takes the reloading as the signal to execute the action, such as "Click item" and "Click to paginate". For the web page using AJAX, it updates new contents without reloading. As there is no reloading, Octoparse doesn't receive the signal to act and would be stuck in the last step. As a result, we may get zero, or much fewer extracted data than we expect.

So when you want to scrape data from a web page using AJAX, you need to set up AJAX timeout to avoid Octoparse from being stuck. For example, if you set up 2 second AJAX timeout for "Click to paginate" action, Octoparse will wait for 2 seconds, and then execute the action. In this case, Octoparse has no need to wait for the reloading signal to act.

 

2) When and how to set up AJAX timeout in Octoparse?

Because websites usually apply AJAX technique on elements that require being clicked, such as "Load more" and "View reviews", AJAX timeout set up is very necessary for steps like "Click to paginate" or "Click item".

First, we need to identify whether there’s AJAX or not. If you find there's no reloading sign likeafter clicking on an element to update the page, then you can be sure that the element is using AJAX.

 

To set up AJAX timeout, you can go to "AJAX Load" and select "Load the page with AJAX" in Customize Action.

 

After ticking the "Load the page with AJAX" box, you can set up AJAX timeout. Usually, we recommend you to select 2-4 seconds.

 

 

3) Do not set up AJAX timeout when there’s no AJAX.

When you scrape from a web page that needs reloading after the update, please don’t set up AJAX timeout. Otherwise, Octoparse will stop reloading the web page by the AJAX timeout that you set up, which may result in incomplete page loading. If the web page doesn’t load completely, Octoparse may have problems in scraping data or executing the next step in the workflow.

 

 

Related articles:

Lesson 6: Pagination - Capture data from multiple pages 

Dealing with Infinitive Scrolling/Load More 

Extract multiple pages through pagination 

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png