Web Crawling Case Study | Crawling Data from Otlob.com

Thursday, May 11, 2017 10:29 AM

 

Welcome to Octoparse web scraping case study!

In this tutorial, we will walk through detailed steps to learn how to crawl data from otlob.com.

 

List features covered 

 

Now, let's get started!

 

Step 1. Set up basic information

  • Click "Quick Start"
  • Create a new task in the Advanced Mode
  • Complete the basic information

 

 

Step 2. Navigate to the target website

 

Step 3. Create a list of items

Once the webpage finishes loading, we would notice that menu items are nicely arranged with very similar section layout. That means we could build a loop list to add all of these similar sections and configure Octoparse to extract data from each single section later.

  • Move your cursor over the menu items with similar layout, where you would extract the data info from.
  • Click any where on the first section on the web page 
  • Click "Expand the selection area" to the point where the outlined box includes all the content you want.
  • Click "Create a list of items"
  • Click "Add current item to the list"

Note: "Expand the selection area" is usually used when the selection had not been identified properly in the first place.

 

Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click "Continue to edit the list"
  • Click a second section with similar layout
  • Click "Add current item to the list" again

Then, we get parts of the sections added to the list automatically.

  • Click "Finish Creating List"
  • Click "loop", which means to click on each section on the list to extract the selected data

 

 

Step 4. Modify XPath to build a complete loop list

After finish creating a loop list in the Step 3, we would notice that only parts of the menu items are added into the loop list. Thus, we need to modify the XPath of the variable list to get all of the menu items added into our loop list!

  • Copy the XPath generated in Octoparse in the Firepath.
  • Modify and locate the correct XPath to include all of the loop items.
  • Paste the modified XPath: .//*[@id='menu']/div/div[3]/section/article/article in the Variable list.
  • Click "Save"

Now, we can see that all of the items are fetched by Octoparse.

 

 

Step 5. Select the data to be extracted and Rename data fields

Now, we want to extract data from these similar menu items.

Back to the "Extract Data" action and click it and we will find the first menu item section is outlined with dotted line box. That means we need to extract data from the first section by following the steps below. Note that the extraction action we will be setting up for the first section is going to apply to the rest of the loop list.

  • Click the product name.
  • Select "Extract text"
  • Follow the same steps to extract other data.
  • Rename any field names if necessary.
  • Click "Save"

 

Step 6. Start running your task 

Now we are done configuring the task, and it's time to run the task to get the data we want.

  • Click “Next”
  • Click “Next”
  • Select "Local Extraction"

 

There is Local Extraction and Cloud Extraction (premium plan) in Octoparse. With a Local Extraction, the task will be run in your own machine; with a Cloud Extraction, the task will be run on Octoparse Cloud Platform,  which means you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

 

Step 7. Check and export the data 

When the task is completed, you can click the export button to export your data results.

  • Check the data extracted
  • Click "Export" button  to export the results to Excel file, databases or other formats and save the file to your computer

 

 

 

Done !

 

To learn more about related topics, you can check out the tutorials below:

How to Use XPath in Octoparse

Modify XPath Manually in Octoparse

Automatically Scrape Dynamic Websites(Example:Twitter)

Web Scraping - Modify X Path For "Load More" Button with Octoparse

 

We have also picked up some case study you may feel interested in as following:

Web Scraping Case Study | Scraping Data from Yelp

Scrape Emails from Facebook Pages

Octoparse Smart Mode -- Get Data in Seconds

 

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

Sign up today!

 

 

<

btn_sidebar_use.png
btn_sidebar_form.png