Web Scraping Tutorial: Branch Judgement

Wednesday, March 08, 2017 2:45 AM

In this tutorial, we will show you how to make judgement about whether a specific image is within a particular web page or not.  To execute the branch judgement, we need to modify and edit the XPath of certain elements. You can extract data from your local machine or in the Cloud Platform with a faster speed.

 

The website URL we will use is https://www.yahoo.com/news/.

You can directly download the task (OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the latest tech news articles from Yahoo News. (Download the extraction task for this tutorial HERE just in case you need it.)

 

Part 1 - Make a scraping task in Octoparse

Step 1. Set up basic information.

Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".

 

Step 2.

Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

(URL of the example: https://www.yahoo.com/news/)


 

Step 3.

Right click the first article. ➜ Create a list of articles with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".

Then the first article has been added to the list. ➜ Click "Continue to edit the list".

Right click the second selected article Click "Add current item to the list" again (Now we get all the articles with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the subtitles.

 

Step 4.

Right click the first paragraph of the first article in the "Loop Item"Create a list of articles with similar layout. Click "Create a list of items" (articles with similar layout). ➜ "Add current item to the list".

Then the first article has been added to the list. ➜ Click "Continue to edit the list".

Right click the second selected paragraph of the first article. Click "Add current item to the list" again (Now we get all the paragraphs with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the content of the paragraphs.

 

Note: Right click the content to prevent from triggering the hyperlink of the content if necessary.

 

Step 5.

Modify the XPath of the Variable list to include the image in this article.

Go to "Loop Mode" ➜ "Variable list" ➜ Modify the XPath of the Variable list as: //article[@data-type='story']/*/*  ➜ Click "Save"

Note: If the URL keeps loading while the content of the website has fully loaded, you can click the multiplication sign (×) to prevent it from loading.

 

Step 6. 

Right click the image of the article in the "Loop Item" or you can Right Click the image in the Web Browser ➜ Select "Extract Outer Html, the page source code, text with format and images"

The content will be selected in Data Fields. ➜ Then click "Save".

 

Step 7.

Drag the "Branch Judgement" into the Workflow Designer Drag the "Extract Data" into the left branch of "Branch Judgement".

 

Step 7.

Next, we need locate the XPath of the image, so that we can judge if the current loop contains the image.

Click the left branch Go to "Execute branch when:" Select "Current loop item contains elements" Edit the "Element Xpath" as : //img Click "Save"

Then, when we click the left branch to make judgement, the reminder "Element exists. Branch judgment is True."

 

 

Part 2. Schedule a task and run it on Octoparse's cloud platform.

After you perfectly made the scraping by following the steps above in this web scraping tutorial, you can schedule your task to run it on Octoparse's cloud platform.

Step 1. Right Click the task you've just made in "My Task", select "Schedule Cloud Extractions" ➜ Select the option “Schedule Cloud Extraction Settings” to begin the scheduling process.

 

Step 2. Set the parameters.

In the “Schedule Cloud Extraction Settings” dialog box, you can select the Periods of Availability for the extraction of your task and the Run mode - running your periodic tasks to collect data with varying intervals.

· Periods of Availability - The data extraction period by setting the Start date and End date.

· Run Mode - Once, Weekly, Monthly, Real Time

 

 

After you click "OK" in the Cloud Extraction Scheduled window, the task will be added to the waiting queue and you can check the status of the task.

 

 

Author: The Octoparse Team

Download Octoparse Today
For more information about Octoparse, please click here.

 

 

Author's Picks

Scheduled Data Extraction - Octoparse Cloud Web Scraping Service

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

Scraping Product Detail Pages from eBay.com

 

 

 

 

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png