Web Scraping Case Study | Crawling data from Rakuten Global Market

Wednesday, May 10, 2017 12:46 PM

In this tutorial, we will walk through the detailed steps to crawl data from retail website, rakuten.com.

 

List of features covered 

  • Set up pagination
  • Build a loop list
  • Modify XPath

 

Now, let's get started!

 

Step 1. Set up basic information

  • Click "Quick Start "
  • Create a new task in the Advanced Mode
  • Complete the basic information

 

 

Step 2. Navigate to the target website

 

 

Step 3. Create a list of items

Move your cursor over the products with similar layout, where you would extract the data info from.

  • Click any where on the first section on the web page 
  • When prompted, Click “Create a list of items” (sections with similar layout)
  • Click “Add current item to the list”

 

Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click “Continue to edit the list”
  • Click a second section with similar layout
  • Click “Add current item to the list” again

 Now we get all the sections added to the list with similar layout

  • Click “Finish Creating List”
  • Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data

 

Note: If The selection had not been identified properly in the first place. We need to click “Expand the selection area” to the point where the outlined box includes all the content you want to crawl.

 

 

Step 4. Select the data to be extracted and rename data fields

  • Click the data field “Product Description”
  • Select “Extract text”
  • Follow the same steps to extract the other data field
  • Rename the any field names if necessary
  • Click "Save"

 

 

 

Step 5.  Set up Pagination 

To extract from multiple pages, we'll need to configure for pagination, meaning, we will tell Octoparse to scrape from the first page to the last page. 

  • Click on "»" located to the right of page numbers
  • Choose "Loop Click Next Page"

 

 

Step 6. Modify XPath to locate next page

In some cases, pagination does not work correctly because the auto-generated XPath is not accurate. Hence, we'll need to manually figure out the proper XPath to use and modify the setting for "click to paginate". 

  • First, we want to make sure we are on the first page so we can proceed to extract from all the other pages. 
  • Expand the "Advanced options" pane under "Click to paginate", paste the Xpath : .//*[@id='contents']/div/div[2]/div[4]/ul/li/a[contains(text(),"»")] into the “Single Element” text box.
  • Click "Save"

 

 

Step 7. Starting running your task 

  • After saving your extraction configuration,click "Next"
  • Select "Local Extraction"
  • Click "OK" to run the task on your computer

Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane  for the extraction progress.

 

 

Step 8. Export data

  • Click "Export" button to export the extracted data to Excel file, databases or other formats and save the file to your computer

 

Done!

 

Now, you should be able to crawl Rakuten Global Market on your own. Get started with your own crawling task or download this Example to learn more.

 

To learn more about how to scrape from other high profile websites:

Web Scraping Case Study | Scraping Data from Yelp

Scrape Article Information from Google Scholar

extract facebook data

How to Extract Information from Yellow Page Websites

 

Or learn more about what you can do with these powerful features:

Modify XPath Manually in Octoparse

Web Scraping - Modify X Path For "Load More" Button with Octoparse

Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

 

btn_sidebar_use.png
btn_sidebar_form.png