Scrape Websites with Infinitely Scrolling

Saturday, October 08, 2016 5:44 AM

Infinite-scrolling is mostly done by websites with JavaScript or AJAX, which could automatically expand when you scroll down or click on the load more button. Getting data from sites with such infinite scroll feature can be somewhat challenging.

Most people thus may wonder whether they could extract the information which is not listed on the initial page unless you scroll down the page.

Here I will create a guide to help you out. It’s really easy and you will find that a professional coder is not needed.

 

In this tutorial, I will take Jabong for example to show you how to extract data from infinite scrolling websites without pagination in the URL (like ?page=2, ?page=3). If websites had such a feature, you can change the page number and paste the URLs and then you could extract data by using the URL list. If not, follow the steps below.

(Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1. You need to navigate to the target URL. Enter the URL in the build-in browser.

(URL of the example: http://www.jabong.com/clothing/)

 

Step 2. We are now on the search result page. Waiting until the page loaded, select the “Advanced Options”. ➜ Choose “Scroll down to page bottom when finished loading”. ➜ Then enter how many times you want to scroll. Here I choose “Scroll down for one screen”. You can also choose “Scroll down to bottom of the page”.

 

Step 3. Now we finish configuring pagination. You can extract the list information you want. Move your cursor over the section with similar layout, where you would extract data.

Click the first highlighted link ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”. Then the first highlighted link has been added to the list. ➜ Click “Continue to edit the list”.

Click the second highlighted link ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page. Now it’ll automatically repeat the selection.

 

Step 4. This website is a little complicated as you would find that the first section would loop twice.

 

It is because that the other part ( in the pink box) share the same XPath as the section we want to extract.

 

In this case we need to change the XPath to exactly locate the section we want. Copy the XPath to inspect it in FirePath and find the exact XPath. And then paste the exact XPath in the “Variable list” box.

(Note: Click HERE to know more about XPath.)

 

Step 5. Now we can extract the results. You could find that the second item has a discount while the first one doesn’t. If you want to extract the discount information, you could extract the data from the second item.

Click the second item under “Loop Item”. ➜ Click “Loop Action” ➜ Click “Extract Data”. And then you can extract the data you want.

Extract the title of the first section. ➜ Click the title ➜ Select “Extract text”. Other contents can be extracted in the same way.

 

To exactly locate the current price, you need to change the XPath. Choose the second row about price. ➜Choose "Customize Field". ➜Choose "Define ways to loacte an item". ➜And then paste the XPath in the "Relative XPath" box.➜Click “OK”➜ Click “Save”.

 

If you don’t need to extract the discount, you could directly extract the data from the first item, but you also need to change the XPath to exactly locate the current price.

 

Step 6. All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify.

 

Step 7. Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.

 

Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

Pagination: Scrape Data from Websites with Query Strings (1)

Pagination: Scrape Data from Websites with Query Strings (2)

Octoparse Smart Mode -- Get Data in Seconds

Getting started with XPath 1

Getting started with XPath 2

Getting started with XPath 1

Collect Data from LinkedIn

Top 30 Free Web Scraping Software

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Request Pro Trial

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks