Web Scraping Case Study | Scraping product information from Jabong.com

Wednesday, May 03, 2017 9:13 AM

 

Welcome to Octoparse web scraping case study! 

In this tutorial, I will show you how to scrape product information from retail website, Jabong.com (http://www.jabong.com). (Download my extraction task HERE just in case you need it)

 

List of features covered 

  • XPath Modification
  • Infinitive Scroll
  • Loop List

 

Now, let's get started!

 

Step 1. Navigate to the target website

First, navigate to the website you would like to scrape from.

  

Step 2. Scroll down to load complete content

 

To make sure that we'll capture all products from the website, we'll need set up for pagination. 

Unlike other multi-page websites with an ordinary "next page" button, Jabong uses a technique called "infinitive scrolling", meaning additional webpage content is loaded dynamically as user approaches the bottom of the page.

This may seem a little tricky at first glance, luckily Octoparse can easily accommodate infinitive scrolling by having the page loaded completely before the extraction step.  

  • Navigate to "Go to Web Page" action
  • Go to  "Advanced Options"
  • Fill in "Scroll times" , "Time Interval" and "Scroll way"
  • Click "Save"

Here, I set scrolling times to 3, time interval to 1 second and select "Scrolling down for one screen " as the scroll way.

 

 

Step 3. Create a list of items

Now we had finished configuring for infinitive scroll, we can move on to extract the list information we want.

  • Identify the similar sections on the webpage, click any where on the first section. 
  • When prompted, Click "Create a list of items" (sections with similar layout)
  • Click  "Add current item to the list"

Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click "Continue to edit the list"
  • Click another section with similar layout
  • Click "Add current item to the list" once again

Now we get all the sections added to the list. 

  • Click "Finish Creating List"
  • Click "loop", this action will tell Octoparse to click on each section on the list to extract the selected data

 

 

Step 4.  Modify XPath to locate loop items 

This website is a little tricky as you can see the first section repeats in the loop.   

 

This is because that the other part (in the pink box) share the same XPath as the section we want to extract.

 

In this case we would need to modify the XPath to locate the sections accurately.

  • Copy the XPath to inspect it in FirePath
  • Find the exact XPath
  • And then paste the exact XPath in the "Variable list" box.

     (Note: Click HERE to learn more about how to modify XPath)

 

 

Step 5. Select the data to be extracted

After the XPath has been modified, the list now looks accurate and we shall proceed to extract the data we need.

Looking at the different sections, we will notice that the second item has a discount info (-20%) while the first one doesn’t. So in order to extract the discount information, we'll use the second item to define the data fields to capture.

  • Select "Loop Action"
  • Click the second item under "Loop Item"
  • Click "Extract Data"
  • Now, locate the title of the product from the second section, click on it,  select "Extract" when prompted
  • Follow the same steps to extract the other data fields

 

 

Step 6.  Modify the XPath to locate the data info

After a brief inspection, we know that the auto-generated XPath for the current price does not always return the accurate values, so we'll need to modify the XPath. 

  • Select Field2 for current price, click open "Customize Field"
  • Choose "Define ways to locate an item"
  • And then paste the XPath in the "Relative XPath" box
  • Click "OK"
  • Click "Save"

 

If you don’t need to extract discount information, you can skip it and extract the price directly. 

 

 

 

Step 7Rename data fields

Edit the field names directly to reflect the data extracted. 

 

 

Step 8. Start running the task 

Now, we are done configuring the task. It's time to start running the task and get the data we want.

  • After saving extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer.

 

 

There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

Step 9. Check the data and export

The data extracted will be shown in "Data Extracted" pane.

Click "Export" button  to export the results to Excel file, databases or other formats and save the file to your computer.

 

Done!

                                                                                                               

Now you had successfully captured product information from Jabong, check out more web scraping cases:

How to Scrape Websites With Infinite Scroll? (Quora, Facebook,Twitter)

Web Scraping Case Study | Scraping Data from Yelp

Web Crawling Case Study | Scraping ASTA with Pagination (2) - No "Next Button" Found

Scrape Scientific America with Pagination Issue

 

Or learn more about how data is extracted with Octoparse:

Scrape AJAX Pages from USA TODAY

Octoparse Cloud Service - Start your Cloud Extraction Now!

Web Scraping - Modify X Path For "Load More" Button with Octoparse

 

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
btn_sidebar_use.png
btn_sidebar_form.png