Web Scraping Case Tutorial | Scrape product information from Flipkart.comMonday, May 15, 2017 8:16 AM
Welcome to Octoparse case tutorial. In this tutorial, we will walk through the detailed steps to scrape product information from online store: Flipkart.com.
List features covered
- Modify XPath
- Set up Pagination
- Set AJAX Timeout
Now, let's get started!
1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
2. Navigate to the target website
- Enter the target URL in the built-in browser (URL of the example: https://www.flipkart.com/mobiles/samsung~brand/pr?sid=tyy,4io&otracker=nmenu_sub_Electronics_0_Samsung)
- Click "Go" icon to open webpage
Note: If the webpage loads for too long and the content has shown up already, click the "X" at the upper left hand corner to stop the webpage from loading.
Step 3. Create a list of items
Once the webpage stops loading, we can see the product items are arranged in very similar section layout. That means we can build a loop list to add all of these product items to our extraction list.
- Click on the first section to extract
- Select "Create a list of items"
- Select "Add current item to the list"
The first item had been added to the list.
Note: You may notice that as soon as you hoover the mouse over any product section, part or whole of the section are highlighted dynamically, in addition, the highlighted section adjusts interactively as you move your mouse around.
This is an important observation as the highlighted section is exactly the section you will be adding to the extraction list. Hence, in order to make sure the selected section includes the proper data to extract and stay consistent across all items of the list, you should always check to see if the outlined section include the desired data or not. If not, use the "Expand" icon located at the upper right hand corner to manually adjust to the proper selection.
Next, we need to finish adding all items to the list.
- Select "Continue to edit the list"
- Click on a second product section (one with similar layout)
- Select "Add current item to the list" once again
Now, you should see that all product items have been added to the list.
- Click "Finish Creating List"
- Select "Loop" - Octoparse is now configured to click and extract from each section of the list.
Step 4. Select the data to be extracted and rename data fields
Now we can define the data fields to extract. Noted that when you are defining the "Extract Data" action, you are literally creating an extraction template for the rest of the sections on the list.
- To capture the product version, click on "Samsung Galaxy On5 (Gold, 8GB)"
- Select "Extract text"
Notice how the data field had been captured and added to the customization pane, rename the field to "Version".
- Repeat the same steps to extract all the other data fields as desire.
Step 5: Set up pagination
We are done with a single page extraction setup, however, there are multiple pages to capture. Hence, we'll need to add a pagination action to apply the same extraction action to the other pages.
- Click on "Next" located to the right of page numbers
- When prompted, choose "Loop Click Next Page"
This will tell Octoparse to click open each page for more extraction actions
Step 6. Modify XPath to locate the "Next" button
Every time we sep up a pagination action, a XPath will be autogenerated by Octoparse. For this website however, we'll find out that the autogenerated XPath does not always locate the "Next" button accurately; to solve this, we need to modify the XPath manually utilizing Firefox XPath tool.
- Inspect the "Next" button in Firepath
- Locate and modify the XPath of "Next" button
- Back to Octoparse, navigate to the "Cycle Pages" action
- Copy the modified XPath .//span[text()='Next'] from Firepath, and paste it in the "Single Element" Text Box
- Click "Save"
To learn more about XPath modification, refer to http://www.octoparse.com/tutorial/getting-started-with-xpath-1/.
Step 7: Set AJAX Timeout
As this web page uses AJAX technique for pagination, we need to set AJAX timeout for the action "Click to Paginate".
- Navigate to "Click to Paginate" action
- Check for "AJAX Load"
- Set an AJAX timeout to 2 seconds
- Click "Save"
Step 8. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Select "Local Extraction"
- Click "OK" to start
Once the task starts running, data will be shown under the browser as it gets captured from the webpage.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 9. Check the data and export
Once the extraction completes, click the "Export" button to export the results to xls, csv, databases or any other formats of your choice.
Check out the tutorials below to learn more about how to crawl data from a website:
The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!