Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape product data from Tokopedia

Friday, November 16, 2018

In this tutorial, we will show you how to collect product information on Tokopedia (an Indonesia e-commerce site) with Octoparse. 

We will enter each detail page of USB products and scrape the details including the product title, price, rating, image URL, ect.,

To follow through you might want to use the URL in this tutorial:

https://www.tokopedia.com/search?st=product&q=usb

 

This tutorial will also cover:

         · Modify XPath for accurately locating the desired price data

 

Here are the main steps in this tutorial [Download demo task file here ]

1) "Go To Web Page" - to open the target web page

2) Create a pagination loop - to scrape all data from multiple pages

3) Build a "Loop Item" - to loop click into each item on each list

4) Extract data - to select the data for extraction

5) Customize data field by modifying XPath – to improve the accuracy of a certain data field (Optional)

6) Run extraction - to run your task and get data

 

 

 

 

 

 

 

 

1) "Go To Web Page" - to open the target web page

 · Create the task with "Advanced Mode".

 · Paste the URL into the "Extraction URL" box and click "Save URL" to move on.

 

 

 

 

 

 

 

2) Create a pagination loop - to scrape all data from multiple pages

 · Scroll down and click the "Next Page" button on the webpage

 · Click "Loop click next page" on "Action Tips"

Tokopedia applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load in the "Click to paginate" step.

 · Uncheck "Auto retry when no response"

 · Check "Load the page with AJAX"

 · Set up "AJAX Timeout"

 

 

 

Tips!

If you want to learn more about AJAX, here are related tutorials you might need:

· Deal with AJAX 

· Why does Octoparse stop after clicking "Next"?

 

 

 

 

 

3) Build a "Loop Item" - to loop click into each item on each list

We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.

 · Click "Go To Web Page" in the workflow.

 · Select the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

 

When you create a list of items to scrape a website, sometimes the list may include several "Ads" items. To exclude the promotion products in this case, we can start Loop Item building from the 3rd row on this page.

 · Click the title of the first item on the 3rd row

 · Click "Select All" on "Action Tips"

 · Select "Loop click each element"

 

 

Tips!

In this case, we exclude the "Ads" Items by skipping the first two rows. However, when the "ads" items exist in the product list, there is another way for you to exclude them. 

How to exclude "Ads" items when creating a list?

 

 

 

 

 

4) Extract data - to select the data for extraction

 · Click the information you need on the page

 · Select "Extract data" in the "Action Tips"

 · Rename the fields by selecting from the pre-defined list or inputting on your own

 

 

 

 

 

 

 

 

 

 

5) Customize data field by modifying XPath - Improve the accuracy of a certain data field (Optional)

In this case, the price element is not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the price element on each page to be precisely detected.

The revised XPath of the price field is //span[text()='Rp']/following-sibling::span.

 · Click "Customize data field"

 · Select "Customize XPath"

 · Paste the revised XPath into the Matching XPath textbox

 · Click "OK" to save the result.

 

 

 

Tips!

To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need:

 · How to associate data with nearby text?

 · Data fetched to the incorrect data fields

 · Locate elements with XPath 

 

 

 

 

 

 

6) Run extraction - to run your task and get data

 · Click "Start Extraction"

 · Select "Local Extraction" to run the task on your computer

 

 

Here is the sample output.

 

 

 

Was this article helpful? Contact us any time if you need our help!

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png