Scrape Data from Yahoo JP news | Japanese Sites Case Study

Friday, September 08, 2017 7:34 AM

Welcome to Octoparse case tutorials!Today, let’s see how to extract data from News.yahoo.co.jp.

 

List features covered 

 

Now let's get started!

 

Step 1. Set up basic information

  • Click "Quick Start"
  • Create a new task in the "Advanced Mode"
  • Complete the basic information

 

Step 2. Navigate to the target website

  • Enter the target URL in the built-in browser (the example URL: https://news.yahoo.co.jp/zasshi?c=dom)
  • Click "Go" icon to open webpage

 

 

Step 3. Set up pagination  

Now we need to flip through multiple web pages to extract as many data as possible by setting up pagination action.

  • Click the "次へ" button
  • Choose "Loop click the element"

 

 

Step 4. Creating a list of items

We can see that all the news titles are in similar layout, which means we could make a list of all those titles and click them in order

  • Click the first news title
  • Click "Create a list of items" 
  • Click "Add current item to the list"

 

Now, the first item has been added to the list, we need to finish adding all the titles to the list.

  • Click "Continue to edit the list"
  • Click the second news title
  • Click "Add current item to the list" again

All the titles will be automatically added to the list.

  • Click "Finish Creating List"
  • Click "Loop", which means Octoparse would go through the list to click each title

 

 

Step 5. Select the data to be extracted and rename data fields

We will automatically go to the first news article page after clicking "Loop", and then we need to extract data from the news article. Note that the extraction action we will be setting up for this article is going to apply to the rest of the list. Say we want to capture the news title and main body.

  • Click the title
  • Select "Extract text"
  • Follow the same steps to extract the other data
  • Rename any field if necessary
  • Click "Save"

 

Step 6. Start running your task 

Now we are done configuring the task and it's time to run the task to get the data we want.

  • Click "Next"
  • Click "Next"
  • Click "Local Extraction"

There is Local Extraction and Cloud Extraction (premium plan). With Local Extraction, the task will be run in your own machine; with Cloud Extraction, the task will be run on Octoparse Cloud Platform,  which means you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation and API are also supported with the Cloud. Find out more about Octoparse Cloud here

 

Step 7. Check and export the data

After completing the data extraction process, we can choose to check the data extracted or click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

 

Done!

To learn more about how to crawl data from a website, you can refer to these tutorials:

Scraping Stock Data on Yahoo Finance

Scraping Articles from Yahoo! Tech

Web Scraping Tutorial: Branch Judgement

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

Sign up today!

 

 

btn_sidebar_use.png
btn_sidebar_form.png