Scrape Data from Buyma.com | Japanese Sites Case StudyThursday, September 14, 2017 4:51 AM
Welcome to Octoparse case tutorials! Scraping shopping websites to collect product information has become very popular, so today we are going to learn how to scrape a Japanese shopping website: Buyma.com.
List features covered
Now let's get started!
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser (the example URL: https://www.buyma.com/r/-C2106O2/)
- Click "Go" icon to open webpage
Step 3. Set up pagination
As we need to flip through multiple web pages to extract as many data as possible, setting up pagination is quite important.
- Click the “次の40件へ”
- Choose “Loop click the element”
Step 4. Creating a list of items
We can see that all the product blocks are in similar layout, so we can add these blocks into a loop list to configure Octoparse to click each one.
- Click the name of the first product
- Click "Create a list of items"
- Click "Add current item to the list"
Now, the first item has been added to the list, we need to finish adding all the items to the list.
- Click "Continue to edit the list"
- Click the second product name
- Click "Add current item to the list" again
Now we get all the product information added to the list.
- Click "Finish Creating List"
- Click "Loop", which means Octoparse would go through the list to click
Step 5. Modify Xpath to locate product precisely
After we click “Loop”, Octoparse will automatically click the first item in the loop list. If we go back to the "Loop Item", we will find that the list contains images links too. So we need to modify the Xpath to just locate all the links of product titles.
- Click the "Loop Item"
- Enter the correct Xpath(.//*[@id='n_ResultList']/ul/li/div/div/a) in the "Variable list" box
- Click "Save"
Step 5. Select the data to be extracted and rename data fields.
We need to click “Click Item” to load the product detail page to extract information. Note that the extraction action we will be setting up for this product is going to apply to the rest of the list.
- Click "Click Item"
- Click the product name or any other information you want
- Select "Extract text"
- Follow the same steps to extract other data
- Rename any field if necessary
- Click "Save"
Step 7. Start running your task
Now we are done configuring the task and it's time to run the task to get the data we want.
- Click "Next"
- Click "Next"
- Click "Local Extraction"
There is Local Extraction and Cloud Extraction (premium plan). With Local Extraction, the task will be run in your own machine; with Cloud Extraction, the task will be run on Octoparse Cloud Platform, which means you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation and API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Check and export the data
After completing the data extraction process, we can choose to check the data extracted or click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
To learn more about how to crawl data from a website, you can refer to these tutorials:
Web Crawling Case Study | Crawling Data from Booking.com
Web Scraping Case Study | Crawling data from Rakuten Global Market