Web Scraping Case Tutorial | Scrape product information from Flipkart.comMonday, May 15, 2017 8:16 AM
Welcome to Octoparse case tutorial. In this tutorial, we will walk through the detailed steps to scrape product information from online store: Flipkart.com.
List features covered
Now, let's get started!
1. Set up basic information
2. Navigate to the target website
Note: If the webpage loads for too long and the content has shown up already, click the "X" at the upper left hand corner to stop the webpage from loading.
Step 3. Create a list of items
Once the webpage stops loading, we can see the product items are arranged in very similar section layout. That means we can build a loop list to add all of these product items to our extraction list.
The first item had been added to the list.
Note: You may notice that as soon as you hoover the mouse over any product section, part or whole of the section are highlighted dynamically, in addition, the highlighted section adjusts interactively as you move your mouse around.
This is an important observation as the highlighted section is exactly the section you will be adding to the extraction list. Hence, in order to make sure the selected section includes the proper data to extract and stay consistent across all items of the list, you should always check to see if the outlined section include the desired data or not. If not, use the "Expand" icon located at the upper right hand corner to manually adjust to the proper selection.
Next, we need to finish adding all items to the list.
Now, you should see that all product items have been added to the list.
Step 4. Select the data to be extracted and rename data fields
Now we can define the data fields to extract. Noted that when you are defining the "Extract Data" action, you are literally creating an extraction template for the rest of the sections on the list.
Notice how the data field had been captured and added to the customization pane, rename the field to "Version".
Step 5: Set up pagination
We are done with a single page extraction setup, however, there are multiple pages to capture. Hence, we'll need to add a pagination action to apply the same extraction action to the other pages.
This will tell Octoparse to click open each page for more extraction actions
Step 6. Modify XPath to locate the "Next" button
Every time we sep up a pagination action, a XPath will be autogenerated by Octoparse. For this website however, we'll find out that the autogenerated XPath does not always locate the "Next" button accurately; to solve this, we need to modify the XPath manually utilizing Firefox XPath tool.
To learn more about XPath modification, refer to http://www.octoparse.com/tutorial/getting-started-with-xpath-1/.
Step 7: Set AJAX Timeout
As this web page uses AJAX technique for pagination, we need to set AJAX timeout for the action "Click to Paginate".
Step 8. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
Once the task starts running, data will be shown under the browser as it gets captured from the webpage.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 9. Check the data and export
Once the extraction completes, click the "Export" button to export the results to xls, csv, databases or any other formats of your choice.
Check out the tutorials below to learn more about how to crawl data from a website:
The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!