Getting Started With XPath 2

Monday, April 25, 2016 6:15 AM

Before reading the article, you can learn basic HTML & XPath knowledge in these documents.

Brief Intro to HTML Document

XPath - Brief Introduction

Getting Started With XPath 1

Introduction to Octoparse XPath Tool

 

 

In our previous tutorial ‘Getting Started With XPath 1’, you’ve learned how to use FireBug and FirePath to generate/edit XPath. In this tutorial we will continue learning how to write path expressions.

 

Octoparse provides an XPath engine for HTML documents so that we can precisely locate the data on a webpage. We can use the Firefox extension, FirePath, to find ou the path expression of each element and paste it into XPath engine of Octoparse. Also we can input the path expression by ourselves.

 

Let’s go over how to locate XPath by Firepath.


 

 

Usual functions

 

FireBug and FirePath are useful tools to generate XPath. Sometimes we can use some usual functions to help us generate XPath.

 

 

Remember, all expressions must be put inside the square brackets([]) and tags such as * and div must be specified before the expressions. Case sensitive.

 

and/or/not

These operators are used to connect multiple attributes inside one element/tag. 

 

text()

This function can precisely locate the text node if you know the exact text, or it returns a node set when you enter part of the text. The format is [text()=‘’] and enter text between single quotas(‘’).

Syntax example: //a[text()='Music']

 

contains()

The function operates on a string either for text node or element attribute. It fuzzily match strings you want and returns all the strings containing the partial strings you enter. When you want to find some text nodes, you can use [contains(text(), ‘’)]; when you want to find some strings that describe element attributes, you can use [contains(@attribute_name, ‘’)]. The delimiter, comma, before the single quotas(‘’) is used to separate multiple parameters. Enter text or strings between single quotas(‘’).

 

         Syntax example: //*[contains(text(), 'Converter')]

//*[contains(@class,'dd')]

 

positions() & last()

The function is used to index all child elements of the parent element. The format is [position()= number] or just keep the index number in square brackets. You can use certain characters like greater-than, less-than with and / or / not operators to index several child elements.

Syntax example: //div[@class='_Ugf']/div[position()=1]

//div[@class='_Ugf']/div[position()>2 and position()<5]

 

 

The [last()] is used to index the last element, usually used with [position()= number].

 

.//*[@id='plist']/ul/li[last()]                           Selects the last item of the Li tag of the UL tag of the

                                                                   all the tags with the id attribute whose value is plist.

.//*[@id='plist']/ul/li[position()>last()-1]       Selects the last item of the Li tag (same as above)

.//*[@id='plist']/ul/li[position()>last()-5]       Selects the last five items of the Li tag

.//*[@id='plist']/ul/li[position()<last()-5]       Selects all the items of the Li tag except the last five one

 

 

 

following-sibling:: & preceding-sibling::

The following-sibling function is used to select all other sibling nodes following the current node. Two colons are used following the function. The format is following-sibling::.

 

Syntax example: //div[@class='kno-vrt-t kno-fb-ctx']/following-sibling::div

Similarly, the preceding-sibling:: function can used to selects all sibling elements that preceded it in the current node.

Syntax example: //li[preceding-sibling::li='Apple Mobiles']

 

You can now generate XPath yourself to some extent. Or you can use Octoparse XPath Tool to help you generate correct XPath to better locate the web element therefore grab the data you want.

 

Download Octoparse today! Try our free edition and get all the web data you want!

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png