When you couldn’t find HTML elements you want on the web page, you will need to use XPath expressions to find web page elements in the source code. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. It’s often referred to simply as “an XPath” and used to navigate through elements and attributes in an XML document.
XML and HTML
Before we dive into XPath, we will briefly introduce what XML and HTML are, and the difference between these two languages.
According to Wikipedia, XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. It’s designed to store and transport data, as well as to used for the representation of arbitrary data structures.
HTML can be recognized as an non-standard XML format. XML is focus more on carrying data while HTML is focus more on displaying data.
XPath is used to navigate through elements and attributes in an XML document. All the web pages are HTML documents in nature. Octoparse provides an XPath engine for HTML documents so that we can use XPath to locate data on web page precisely.
Here are examples of XPath that Octoparse generated automatically on the Customize Current Action pane:
//UL[@class=’nav navbar-nav center-nav’]
So what do these path expressions mean?
XPath uses path expressions to select nodes. The node is selected by following a path or steps. (More detailed information please visit https://en.wikipedia.org/wiki/XPath.)
Below, we’ve listed the most useful path expressions posted on w3school.com:
|nodename||Selects all nodes with the name “nodename”|
|/||Selects from the root node|
|//||Selects nodes in the document from the current node that mach the selection no matter where they are|
|.||Selects the current node|
|..||Selects the parent of the current node|
|*||Matches any element node|
|@*||Matches any attribute node|
|Matches any node of any kind|
There are some predicates in XPath expressions that are used to find a specific node or a node that contains a specific value and always embedded in square brackets. Below we would share with you the table posted on w3school.com about some path expressions with predicates and the corresponding results:
|X path Expression||Results|
|/bookstore/book[last()]||Selects the last book element that is the child of the bookstore element|
|/bookstore/book[position()<3]||Selects the first two book elements that are children of the bookstore element|
|//title[@lang=’en’]||Selects all the title elements that have a “lang” attribute with a value of “en”|
|/bookstore/book[price>35.00]/title||Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00|
Now, we know that //UL[@class=’nav navbar-nav center-nav’] means to select all the UL elements that have a “class” attribute with a value of “nav navbar-nav center-nav”, and //*[@id=’gdp’] means to select all elements in the document that have a “id” attribute with a value of “gdp”.
It happens that we sometimes need to manually edit the XPath with XPath tools on Octoparse to fetch data on web page.