undefined

Getting started with XPath 1

Wednesday, April 20, 2016 11:02 AM

Before reading the article, it's strongly recommended to view these two

articles for better understanding HTML and XPath.

XPath - Brief Introduction

Brief Intro to HTML Document

Getting Started With XPath 2

Introduction to Octoparse XPath Tool

 

Sometimes users need to use XPath expressions to locate any type of information on a webpage.

 

Octoparse provides an XPath engine for HTML documents so that we can precisely locate the data on a webpage. Here, we will introduce some useful tools we need to get our hands-on. They are extensions to the Firefox browser called ‘FireBug’ and ‘FirePath’.

 

As one of most popular Firefox add-ons, FireBug is used to easily look up the HTML/SCC of any element on a webpage so that makes it very easy to debug and develop webpages. FirePath is a FireBug extension that adds a development to edit, inspect and generate XPath expressions, CSS 3 selectors and JQuery selectors with auto-completion for XPath. Users who don’t know much of anything about XPath will benefit a lot from FirePath.

 

In this tutorial, we will learn how to install FireBug and FirePath and how to use these two tools to edit and generate XPath expressions.

 

Installation of FireBug and FirePath

 

The installation process would be bifurcated into two-step as follows.

 

Step 1.

Launch the Mozilla Firefox browser and open the ‘Open menu’ to select the Add-ons section.

 

In the Add-ons Manager page, enter FireBug in the search bar in the top right corner of the browser. Then hit the Install button.

 

Enter FirePath in the search bar and Install it.

 

Step 2.

Back to the start page and navigate to an example link. Once the browser has loaded the page, click on the Firebug icon in the top right corner. The firebug window should appear as below:

 

For this example link, we are going to keep things very simple and easy. We would find the HTML code on this page that refers to that web elements.

 

Here, we first click on the ‘ Inspect’ button in FireBug and then hover the cursor over the content of the web page. You will see blue borders appear when you move the cursor. When you click ¥41908, you can see the following appear in FireBug:

 

FireBug and FirePath have found the HTML code for the content ‘¥41908’ on the page. Right-click the corresponding HTML code and select the option ‘Copy XPath’. The XPath expression is “.//*[@id='gdp']”.

 

Use XPath expression in Octoparse

 

You can use this XPath expression in Octoparse as well. Let’s understand the process step by step as follows.

 

Step 1.

Launch Octoparse and build a new task with Advance mode. Drag an ‘Open page’ button to the workflow designer.

 

Step 2.

Open “http://192.168.0.4/xpath.html” and click ‘Save’.

 

Step 3.

After the web page is loaded, click the ‘7.2%’ in the web page and select the ‘Extract text’ option.

 

In the Define Fields table, Click the field we just extracted and click on the ‘Customize Field’ button and select the second option ‘ Define ways to locate an item’.

 

Step 4.

In the ‘Matching XPath’ bar, paste the XPath expression “.//*[@id='gdp']” we just copied from FireBug. Then click OK. (You can edit the XPath expressions by using our ‘XPath Tools’.) 

 

We would see the content extracted has been changed to ‘¥41908’.

 

We can edit XPath expressions by ourselves! Let’s take a look at the HTML code.

 

The XPath expression for ‘¥41908’ can be written as .//*[@id=gdp]. Why?

 

Let’s glance at the meaning of the operators in sequence:


So this XPath expression means to select all elements in the document that have a “id” attribute with a value of “gdp”. Easy to understand, right?

 

 

We can try to edit XPath expression manually. For example, we are going to fetch “Growth Rate: 7.2%” and the HTML code is <span id="rate">7.2%</span>.

 

 

 

So the XPath expression should be //span[@id=rate] or //span[2]. Let’s put these two path expressions separately into Matching XPath bar to see if they really work. Then click the OK button.

 

Paste the first path expression into the Matching XPath box and click OK.

 

Paste the second path expression into the Matching XPath box and click OK.

 

And we have extracted the same element “Growth Rate” by these two path expressions. Below is the screenshot of the result.

 

We have learned how to edit and create XPath expressions by using some useful tools. We also provide additional documents, Getting started with XPath 2,  to help you better handle Octoparse. 

 

 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept Close