Web Scraping Case Study | Scraping product information from Jabong.com
Wednesday, May 3, 2017 9:13 AM
Welcome to Octoparse web scraping case study!
In this tutorial, I will show you how to scrape product information from retail website, Jabong.com (http://www.jabong.com). (Download my extraction task HERE just in case you need it)
List of features covered
- XPath Modification
- Infinitive Scroll
- Loop List
Now, let's get started!
Step 1. Navigate to the target website
First, navigate to the website you would like to scrape from.
- Enter the target URL in the built-in browser (URL of the example: http://www.jabong.com/clothing/)
- Click "Go" icon to open webpage
Step 2. Scroll down to load complete content
To make sure that we'll capture all products from the website, we'll need set up for pagination.
Unlike other multi-page websites with an ordinary "next page" button, Jabong uses a technique called "infinitive scrolling", meaning additional webpage content is loaded dynamically as user approaches the bottom of the page.
This may seem a little tricky at first glance, luckily Octoparse can easily accommodate infinitive scrolling by having the page loaded completely before the extraction step.
- Navigate to "Go to Web Page" action
- Go to "Advanced Options"
- Fill in "Scroll times" , "Time Interval" and "Scroll way"
- Click "Save"
Here, I set scrolling times to 3, time interval to 1 second and select "Scrolling down for one screen " as the scroll way.
Step 3. Create a list of items
Now we had finished configuring for infinitive scroll, we can move on to extract the list information we want.
- Identify the similar sections on the webpage, click any where on the first section.
- When prompted, Click "Create a list of items" (sections with similar layout)
- Click "Add current item to the list"
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click "Continue to edit the list"
- Click another section with similar layout
- Click "Add current item to the list" once again
Now we get all the sections added to the list.
- Click "Finish Creating List"
- Click "loop", this action will tell Octoparse to click on each section on the list to extract the selected data
Step 4. Modify XPath to locate loop items
This website is a little tricky as you can see the first section repeats in the loop.
This is because that the other part (in the pink box) share the same XPath as the section we want to extract.
In this case we would need to modify the XPath to locate the sections accurately.
- Copy the XPath to inspect it in FirePath
- Find the exact XPath
- And then paste the exact XPath in the "Variable list" box.
(Note: Click HERE to learn more about how to modify XPath)
Step 5. Select the data to be extracted
After the XPath has been modified, the list now looks accurate and we shall proceed to extract the data we need.
Looking at the different sections, we will notice that the second item has a discount info (-20%) while the first one doesn’t. So in order to extract the discount information, we'll use the second item to define the data fields to capture.
- Select "Loop Action"
- Click the second item under "Loop Item"
- Click "Extract Data"
- Now, locate the title of the product from the second section, click on it, select "Extract" when prompted
- Follow the same steps to extract the other data fields
Step 6. Modify the XPath to locate the data info
After a brief inspection, we know that the auto-generated XPath for the current price does not always return the accurate values, so we'll need to modify the XPath.
- Select Field2 for current price, click open "Customize Field"
- Choose "Define ways to locate an item"
- And then paste the XPath in the "Relative XPath" box
- Click "OK"
- Click "Save"
If you don’t need to extract discount information, you can skip it and extract the price directly.
Step 7. Rename data fields
Edit the field names directly to reflect the data extracted.
Step 8. Start running the task
Now, we are done configuring the task. It's time to start running the task and get the data we want.
- After saving extraction configuration,click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 9. Check the data and export
The data extracted will be shown in "Data Extracted" pane.
Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Done!
Now you had successfully captured product information from Jabong, check out more web scraping cases:
How to Scrape Websites With Infinite Scroll? (Quora, Facebook,Twitter)
Web Scraping Case Study | Scraping Data from Yelp
Web Crawling Case Study | Scraping ASTA with Pagination (2) - No "Next Button" Found
Scrape Scientific America with Pagination Issue
Or learn more about how data is extracted with Octoparse:
Scrape AJAX Pages from USA TODAY
Octoparse Cloud Service - Start your Cloud Extraction Now!
Web Scraping - Modify X Path For "Load More" Button with Octoparse
Author: The Octoparse Team
For more information about Octoparse, please click here.
Top 30 Free Web Scraping Software
- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf