Getting Started With XPath 2Monday, April 25, 2016 6:15 AM
Before reading the article, you can learn basic HTML & XPath knowledge in these documents.
In our previous tutorial ‘Getting Started With XPath 1’, you’ve learned how to use FireBug and FirePath to generate/edit XPath. In this tutorial we will continue learning how to write path expressions.
Octoparse provides an XPath engine for HTML documents so that we can precisely locate the data on a webpage. We can use the Firefox extension, FirePath, to find ou the path expression of each element and paste it into XPath engine of Octoparse. Also we can input the path expression by ourselves.
Let’s go over how to locate XPath by Firepath.
FireBug and FirePath are useful tools to generate XPath. Sometimes we can use some usual functions to help us generate XPath.
Remember, all expressions must be put inside the square brackets() and tags such as * and div must be specified before the expressions. Case sensitive.
These operators are used to connect multiple attributes inside one element/tag.
This function can precisely locate the text node if you know the exact text, or it returns a node set when you enter part of the text. The format is [text()=‘’] and enter text between single quotas(‘’).
Syntax example: //a[text()='Music']
The function operates on a string either for text node or element attribute. It fuzzily match strings you want and returns all the strings containing the partial strings you enter. When you want to find some text nodes, you can use [contains(text(), ‘’)]; when you want to find some strings that describe element attributes, you can use [contains(@attribute_name, ‘’)]. The delimiter, comma, before the single quotas(‘’) is used to separate multiple parameters. Enter text or strings between single quotas(‘’).
Syntax example: //*[contains(text(), 'Converter')]
positions() & last()
The function is used to index all child elements of the parent element. The format is [position()= number] or just keep the index number in square brackets. You can use certain characters like greater-than, less-than with and / or / not operators to index several child elements.
Syntax example: //div[@class='_Ugf']/div[position()=1]
//div[@class='_Ugf']/div[position()>2 and position()<5]
The [last()] is used to index the last element, usually used with [position()= number].
.//*[@id='plist']/ul/li[last()] Selects the last item of the Li tag of the UL tag of the
all the tags with the id attribute whose value is plist.
.//*[@id='plist']/ul/li[position()>last()-1] Selects the last item of the Li tag (same as above)
.//*[@id='plist']/ul/li[position()>last()-5] Selects the last five items of the Li tag
.//*[@id='plist']/ul/li[position()<last()-5] Selects all the items of the Li tag except the last five one
following-sibling:: & preceding-sibling::
The following-sibling function is used to select all other sibling nodes following the current node. Two colons are used following the function. The format is following-sibling::.
Syntax example: //div[@class='kno-vrt-t kno-fb-ctx']/following-sibling::div
Similarly, the preceding-sibling:: function can used to selects all sibling elements that preceded it in the current node.
Syntax example: //li[preceding-sibling::li='Apple Mobiles']
You can now generate XPath yourself to some extent. Or you can use Octoparse XPath Tool to help you generate correct XPath to better locate the web element therefore grab the data you want.
Download Octoparse today! Try our free edition and get all the web data you want!
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.