XPath - Brief Introduction

4/21/2016 7:38:06 AM

When you couldn’t find HTML elements you wanted in the web page, you will need to use XPath expressions to find web page elements in the source code. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. It’s often referred to simply as “an XPath” and used to navigate through elements and attributes in an XML document.



Before we dive into XPath, we will briefly introduce what XML and HTML are, and the difference between these two languages.


According to Wikipedia, XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. It’s designed to store and transport data, as well as to used for the representation of arbitrary data structures.


HTML (HyperText Markup Language) is the standard markup language used to create web pages. HTML, along with CSS and JavaScript, is used to create web pages and user interfaces for mobile and web applications. Web browsers can read HTML files and render them into visible or audible web pages. HTML describes the structure of a website semantically and is used for the presentation or appearance of the document (web page).


HTML can be recognized as an non-standard XML format. XML is focus more on carrying data while HTML is focus more on displaying data.






XPath is used to navigate through elements and attributes in an XML document. All the web pages are HTML documents in nature. Octoparse provides an XPath engine for HTML documents so that we can use XPath to locate data on web page precisely.


Here are examples of XPath that Octoparse generated automatically on the Customize Current Action pane:


//UL[@class='nav navbar-nav center-nav']



So what do these path expressions mean?


XPath uses path expressions to select nodes. The node is selected by following a path or steps. (More detailed information please visit

http://www.w3schools.com/xsl/xpath_syntax.asp and https://en.wikipedia.org/wiki/XPath.)


Below, we’ve listed the most useful path expressions posted on w3school.com:





Selects all nodes with the name “nodename”


Selects from the root node


Selects nodes in the document from the current node that mach the selection no matter where they are


Selects the current node


Selects the parent of the current node


Selects attributes


Matches any element node


Matches any attribute node


Matches any node of any kind


There are some predicates in XPath expressions that are used to find a specific node or a node that contains a specific value and always embedded in square brackets. Below we would share with you the table posted on w3school.com about some path expressions with predicates and the corresponding results:


      X path Expression



Selects the last book element that is the child of the bookstore element


Selects the first two book elements that are children of the bookstore element


Selects all the title elements that have a "lang" attribute with a value of "en"


Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00


Now, we know that //UL[@class='nav navbar-nav center-nav'] means to select all the UL elements that have a “class” attribute with a value of “nav navbar-nav center-nav”, and //*[@id=gdp] means to select all elements in the document that have a “id” attribute with a value of “gdp”.


It happens that we sometimes need to manually edit the XPath with XPath tools on Octoparse to fetch data on web page.







Author: The Octoparse Team




Download Octoparse Today



For more information about Octoparse, please click here.

Sign up today.



Author's Picks


About Octoparse

Octoparse 6.0 is Now Available

What A Price Monitor Can Help you?

Examples of Businesses Who Use Data Scraping

Video: Get Started with X path

Video: Get Started with X path 2




Recent Posts


Leave us a message

Your name*

Your email*




Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.