Web Scraping Case Study | Crawling data from Rakuten Global Market
Wednesday, May 10, 2017 12:46 PMIn this tutorial, we will walk through the detailed steps to crawl data from retail website, rakuten.com.
List of features covered
- Set up pagination
- Build a loop list
- Modify XPath
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start "
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser (URL of the example: http://global.rakuten.com/en/category/100040/?p=1&l-id=rgm-top-en-nav-electronics-laptop)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Move your cursor over the products with similar layout, where you would extract the data info from.
- Click any where on the first section on the web page
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
Now we get all the sections added to the list with similar layout
- Click “Finish Creating List”
- Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data
Note: If The selection had not been identified properly in the first place. We need to click “Expand the selection area” to the point where the outlined box includes all the content you want to crawl.
Step 4. Select the data to be extracted and rename data fields
- Click the data field “Product Description”
- Select “Extract text”
- Follow the same steps to extract the other data field
- Rename the any field names if necessary
- Click "Save"
Step 5. Set up Pagination
To extract from multiple pages, we'll need to configure for pagination, meaning, we will tell Octoparse to scrape from the first page to the last page.
- Click on "»" located to the right of page numbers
- Choose "Loop Click Next Page"
Step 6. Modify XPath to locate next page
In some cases, pagination does not work correctly because the auto-generated XPath is not accurate. Hence, we'll need to manually figure out the proper XPath to use and modify the setting for "click to paginate".
- First, we want to make sure we are on the first page so we can proceed to extract from all the other pages.
- Expand the "Advanced options" pane under "Click to paginate", paste the Xpath : .//*[@id='contents']/div/div[2]/div[4]/ul/li/a[contains(text(),"»")] into the “Single Element” text box.
- Click "Save"
Step 7. Starting running your task
- After saving your extraction configuration,click "Next"
- Select "Local Extraction"
- Click "OK" to run the task on your computer
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.
Step 8. Export data
- Click "Export" button to export the extracted data to Excel file, databases or other formats and save the file to your computer
Done!
Now, you should be able to crawl Rakuten Global Market on your own. Get started with your own crawling task or download this Example to learn more.
To learn more about how to scrape from other high profile websites:
Web Scraping Case Study | Scraping Data from Yelp
Scrape Article Information from Google Scholar
How to Extract Information from Yellow Page Websites
Or learn more about what you can do with these powerful features:
Modify XPath Manually in Octoparse
Web Scraping - Modify X Path For "Load More" Button with Octoparse
Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)
Author: The Octoparse Team
For more information about Octoparse, please click here.