Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape product data from TokopediaMonday, January 28, 2019
In this tutorial, we will show you how to collect product information on Tokopedia (an Indonesia e-commerce site) with Octoparse.
We will enter each detail page of USB products and scrape the details including the product title, price, rating, image URL, ect.,
To follow through you might want to use the URL in this tutorial:
This tutorial will also cover:
· Modify XPath for accurately locating the desired price data
Here are the main steps in this tutorial [Download demo task file here ]
1) "Go To Web Page" - to open the target web page
· Create the task with "Advanced Mode".
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on.
2) Create a pagination loop - to scrape all data from multiple pages
· Scroll down and click the "Next Page" button on the webpage
· Click "Loop click next page" on "Action Tips"
Tokopedia applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load in the "Click to paginate" step.
· Uncheck "Auto retry when no response"
· Check "Load the page with AJAX"
· Set up "AJAX Timeout"
If you want to learn more about AJAX, here are related tutorials you might need：
3) Build a "Loop Item" - to loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.
· Click "Go To Web Page" in the workflow.
· Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
When you create a list of items to scrape a website, sometimes the list may include several "Ads" items. To exclude the promotion products in this case, we can start Loop Item building from the 3rd row on this page.
· Click the title of the first item on the 3rd row
· Click "Select All" on "Action Tips"
· Select "Loop click each element"
In this case, we exclude the "Ads" Items by skipping the first two rows. However, when the "ads" items exist in the product list, there is another way for you to exclude them.
4) Extract data - to select the data for extraction
· Click the information you need on the page
· Select "Extract data" in the "Action Tips"
· Rename the fields by selecting from the pre-defined list or inputting on your own
5) Customize data field by modifying XPath - Improve the accuracy of a certain data field (Optional)
In this case, the price element is not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the price element on each page to be precisely detected.
The revised XPath of the price field is //span[text()='Rp']/following-sibling::span.
· Click "Customize data field"
· Select "Customize XPath"
· Paste the revised XPath into the Matching XPath textbox
· Click "OK" to save the result.
To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need：
6) Run extraction - to run your task and get data
· Click "Start Extraction"
· Select "Local Extraction" to run the task on your computer
Here is the sample output.
Was this article helpful? Contact us any time if you need our help!
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar