Scrape Websites with Infinitely ScrollingSaturday, October 08, 2016 5:44 AM
Most people thus may wonder whether they could extract the information which is not listed on the initial page unless you scroll down the page.
Here I will create a guide to help you out. It’s really easy and you will find that a professional coder is not needed.
In this tutorial, I will take Jabong for example to show you how to extract data from infinite scrolling websites without pagination in the URL (like ?page=2, ?page=3). If websites had such a feature, you can change the page number and paste the URLs and then you could extract data by using the URL list. If not, follow the steps below.
(Download my extraction task of this tutorial HERE just in case you need it.)
Step 1. You need to navigate to the target URL. Enter the URL in the build-in browser.
(URL of the example: http://www.jabong.com/clothing/)
Step 2. We are now on the search result page. Waiting until the page loaded, select the “Advanced Options”. ➜ Choose “Scroll down to page bottom when finished loading”. ➜ Then enter how many times you want to scroll. Here I choose “Scroll down for one screen”. You can also choose “Scroll down to bottom of the page”.
Step 3. Now we finish configuring pagination. You can extract the list information you want. Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”. Then the first highlighted link has been added to the list. ➜ Click “Continue to edit the list”.
Click the second highlighted link ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page. Now it’ll automatically repeat the selection.
Step 4. This website is a little complicated as you would find that the first section would loop twice.
It is because that the other part ( in the pink box) share the same XPath as the section we want to extract.
In this case we need to change the XPath to exactly locate the section we want. Copy the XPath to inspect it in FirePath and find the exact XPath. And then paste the exact XPath in the “Variable list” box.
(Note: Click HERE to know more about XPath.)
Step 5. Now we can extract the results. You could find that the second item has a discount while the first one doesn’t. If you want to extract the discount information, you could extract the data from the second item.
Click the second item under “Loop Item”. ➜ Click “Loop Action” ➜ Click “Extract Data”. And then you can extract the data you want.
Extract the title of the first section. ➜ Click the title ➜ Select “Extract text”. Other contents can be extracted in the same way.
To exactly locate the current price, you need to change the XPath. Choose the second row about price. ➜Choose "Customize Field". ➜Choose "Define ways to loacte an item". ➜And then paste the XPath in the "Relative XPath" box.➜Click “OK”➜ Click “Save”.
If you don’t need to extract the discount, you could directly extract the data from the first item, but you also need to change the XPath to exactly locate the current price.
Step 6. All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify.
Step 7. Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!