Getting Started With XPath 2

Monday, April 25, 2016 6:15 AM

Before reading the article, you can learn basic HTML & XPath knowledge in these documents.

Brief Intro to HTML Document

XPath - Brief Introduction

Getting Started With XPath 1

 

In our previous tutorial ‘Getting Started With XPath 1’, you’ve learned how to use FireBug and FirePath to generate/edit XPath. In this tutorial we will continue learning how to write path expressions.

 

Octoparse provides an XPath engine for HTML documents so that we can precisely locate the data on a webpage. We can use the Firefox extension, FirePath, to find ou the path expression of each element and paste it into XPath engine of Octoparse. Also we can input the path expression by ourselves.

 

Let’s go over how to locate XPath by Firepath.

<img style="max-width: 100%; min-width: 600px; height: auto; display: block; margin-left: auto; margin-right: auto;" src="/media/1920/gif-firepath.gif" alt="" rel="1985" data-id="1985" />

 

 

Usual functions

 

FireBug and FirePath are useful tools to generate XPath. Sometimes we can use some usual functions to help us generate XPath.

 

 

 

Remember, all expressions must be put inside the square brackets([]) and tags such as * and div must be specified before the expressions. Case sensitive.

 

text()

This function can precisely locate the text node if you know the exact text, or it returns a node set when you enter part of the text. The format is [text()=‘’] and enter text between single quotas(‘’).

 //a[text()='Music']

 

contain()

The function operates on a string either for text node or element attribute. It fuzzily match strings you want and returns all the strings containing the partial strings you enter. When you want to find some text nodes, you can use [contains(text(), ‘’)]; when you want to find some strings that describe element attributes, you can use [contains(@attribute_name, ‘’)]. The delimiter, comma, before the single quotas(‘’) is used to separate multiple parameters. Enter text or strings between single quotas(‘’).

 

          //*[contains(text(), 'Converter')]

//*[contains(@class,'dd')]

 

positions()

The function is used to index all child elements of the parent element. The format is [position()= number] or just keep the index number in square brackets. The [last()] is used to index the last element. You can use certain characters like greater-than, less-than with and / or / not operators to index several child elements.

//div[@class='_Ugf']/div[position()=1]

//div[@class='_Ugf']/div[position()>2 and position()<5]

 

 

following-sibling

The function is used to select all other sibling nodes following the current node. Two colons are used following the function. The format is following-sibling::.

 

//div[@class='kno-vrt-t kno-fb-ctx']/following-sibling::div

 

You can now generate XPath yourself to some extent. Or you can use Octoparse XPath tool to help you generate correct XPath to better locate the web element therefore grab the data you want.

 

Download Octoparse today! Try our free edition and get all the web data you want!

 

 

 

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.