Web Crawling Case Study | Scraping data from Justdial.comTuesday, May 09, 2017 11:08 AM
Welcome to Octoparse Web Crawling Case Study. In this tutorial, we will go through the detailed steps to scrape data from justdial.com. For this example, our goal is to capture all the key product information for the TV's listing on justdail.com.
List of features covered
- Building a loop list
- Scrolling down
- Data extraction
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start "
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser (URL of the example: https://www.justdial.com/Shop-Online/Sony-TV/nid-10503460?brand-1=Sony&city=Mumbai&sort=pop_desc)
- Click "Go" icon to open webpage
Step 3. Create a list of items
As soon as the webpage gets loaded, we'll notice all the listings for TV's are arranged in similar sections. Hence, we can build a list of items to scrape from.
- Click any where on the first section
- When prompted, Click "Create a list of items" (sections with similar layout)
- Click "Add current item to the list"
Now, the first item has been added to the list, we need to finish adding all items to the list.
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
All of the similar sections should have been added automatically, at this time, select "finish creating the list", then click "loop".
Note: If the selection had not been identified correctly in the first place, we will need to click "Expand the selection area" to the point where the section needed is outlined properly.
Step 4. Select the data to be extracted
By selecting "Loop" from the last step, Octoparse will automatically select the first item of the list and click it open. Noted the extraction action we will be setting up for this page is going to apply to the rest of the list. So, let's look through the page to spot the data we want and capture it. Say we would like to capture the product title, rating and score.
- Click on the title of the product
- Select “Extract text”. Notice how the title of the product had been captured into Field1.
- Follow the same steps to extract the other data point
Step 5. Rename the data fields
- Rename any data fields if necessary
- Don't forget to click "Save"
Step 6. Scroll down to load complete content
Just as with all the other multi-page websites, after we finish building up the list, we'll need to configure the task for pagination (page flipping) to enable multi-page scraping. Instead of using an ordinary "next page" button, justdial uses a technique called "infinitive scrolling", meaning additional webpage content is loaded dynamically as user approaches the bottom of the page.
This may seem a little tricky at first glance, luckily Octoparse can easily accommodate infinitive scrolling by having the page loaded completely before the extraction step.
- Navigate to the "Go to Web Page" action
- Go to "Advanced Options"
- Fill in "Scroll times" , "Time Interval" and the "Scroll way"
- Click "Save" then "Next"
Here, I set scrolling times to 8, time interval to 2 second and select "Scrolling to bottom of the page " as Scroll way.
The reason that I set these numbers is because I had tested the webpage before and I know that scrolling the webpage for 8 times with 2 seconds between every scroll will give me the whole list. Keep in mind that for different sites, you would need to test the page to get these numbers.
Step 7. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Select "Local Extraction"
- Click “OK” to start
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 8. Export Data
As soon as the task starts to run, notice how the webpage scrolls down automatically to load additional content.
Data will be added to the pane below the browser as it gets extracted from the webpage.
After the task is done, click to export the extracted data to any formats (csv, xls, etc) or database.
Now you have learn how to crawl data from justdial, get started with your own crawling task or download the task Here for this example here to see it for yourself.
To learn more about how to crawl data from high profile websites:
Or learn more about what you can do with Octoparse:
Author: The Octoparse Team
For more information about Octoparse, please click here.