logo
Download
languageENdown
menu

XPath Cheatsheet: Syntax, Axes, and contains() for Web Scraping

star

Bookmark the ultimate XPath cheatsheet. Learn syntax, axes, and XPath, contains() and use text functions to build unbreakable web scrapers. Say goodbye to bad selectors.

7 min read

If you’ve ever spent hours developing a web scraper only to have it break the day after because a website changed its layout, then you understand the ultimate frustration of a web developer.

That’s exactly what happened to me while building an Instagram post scraper. The elements won’t select, the nodes won’t remain consistent, and those auto-generated browser selectors will fail you when you need them the most. The fact of the matter is, when it comes to reliable web scraping, those selectors are a recipe for disaster. The secret is learning your XPath cheatsheet.

CSS selectors are amazing for basic styling and selecting elements, but when you need to traverse through a complex DOM tree and choose elements based on their actual content, they just don’t cut it. That’s when XPath becomes indispensable and why most production scrapers are built around it.

Quick Answer

ExpressionWhat it does
//tagSelects matching nodes anywhere in the document (relative path; preferred over absolute /)
contains(@attr, ‘val’)Matches a partial attribute value — safe for dynamic class names and hashed IDs
contains(text(), ‘val’)Selects elements whose visible text includes a given substring
//tag/parent::*Navigates up to the immediate parent element (upward traversal CSS cannot do)
//tag/following-sibling::tag[1]Selects the immediately following sibling — key for scraping label-value pairs

XPath Basics: Syntax You Need to Know First

You need to know the basics of XPath syntax before you can use advanced functions and navigate complex documents. XPath uses a path-like syntax to navigate the HTML or XML tree structure, just as you move through files and folders on your computer.

The first thing you need to do is learn the difference between absolute and relative paths. Absolute paths start at the root of the document, so they break very easily if you add a single wrapper <div> to the page. Relative paths, on the other hand, look through the entire document, which makes your scrapers much stronger.

Here is a quick reference table of the basic syntax you need to learn by heart:

ExpressionNameDescription & Use CaseExample
/Absolute NodeSelects from the root node. Rarely recommended for web scraping as it breaks easily./html/body/div/p
//Relative NodeSelects nodes anywhere in the document from the current node that match the selection.//div[@class=’product’]
*WildcardMatches any element node. Useful when the tag name is unknown or variable.//*[@id=’main’]
.Current NodeRepresents the current context node. Useful in nested loops during scraping../span
..Parent NodeSelects the immediate parent of the current node. Great for moving up the tree.//a[@id=’link’]/..
@AttributeSelects an attribute (like class, id, href, or src).//img/@src


Mastering these building blocks is non-negotiable. They are the foundation upon which every complex XPath expression is built.

xpath basics octoparse

XPath Axes Cheatsheet

Basic syntax tells you how to choose an element, and XPath axes tell you how to move around it. You can move around the DOM tree with axes by looking at how nodes are related to each other. This is XPath’s superpower that CSS selectors just can’t match. If an element doesn’t have a unique ID or class, you can find it by looking at how it relates to a component that does.

This is a complete list of the most common XPath axes used in web scraping:

child::

Explanation: This picks all of the current node’s direct children. Note: //div/child::p and //div/p are the same thing.

For example: //ul[@class=’menu’]/child::li

parent::

Explanation: Chooses the parent node of the current node. It works just like…

For example, //span[@class=’price’]/parent::div

ancestor::

Explanation: This picks out all the ancestors of the current node, from the parent to the grandparent and so on, all the way to the root. Great for locating high-level wrapper containers.

For example: //td[contains(text(), ‘Total’)]/ancestor::table

descendant::

Explanation: This selects all of the current node’s children, grandchildren, and so on. (Like using // after a node).

For example, //div[@id=’content’]/descendant::a 

following-sibling::

Explanation: This picks out all sibling nodes that appear after the current node in the HTML document, provided they share the same parent. This is a huge help when scraping definition lists or form labels next to inputs.

For example: //label[text()=’Email’]/following-sibling::input

preceding-sibling::

Explanation: This selects all sibling nodes that come before the current node in the HTML document.

For example, //button[@type=’submit’]/preceding-sibling::input[@type=’text’]

self::

Explanation: This selects the currently selected node. Often used with other predicates to check the properties of a node.

For example: //div/self::*[@class=’active’]

When you use axes, your scraper relies on logical structural relationships rather than on styling classes that change often.

xpath cheatsheet octoparse

XPath Functions Cheatsheet

XPath is more than just a way to point to a path. It also includes many built-in functions that can process strings, evaluate Boolean expressions, and count nodes. You won’t have to write complicated post-processing code in your scraping script if you know these functions.

Here is a table of the most essential XPath functions that every web scraper should know:

FunctionSyntax & UsagePlain-English Use CaseXPath Example
contains()contains(string1, string2)The absolute most important function. Checks if a string or attribute contains a specific substring. Addresses all the contents in the XPath.//div[contains(@class, ‘button’)]
text()text()Extracts the text node of an element. Often paired with contains or exact match operators.//a[text()=’Click Here’]
starts-with()starts-with(string1, string2)Checks if a string or attribute begins with a specific value. Perfect for dynamic IDs that append random numbers.//div[starts-with(@id, ‘post-‘)]
normalize-space()normalize-space(string)Strips leading and trailing whitespace, and replaces multiple spaces with a single space. Highly recommended for messy HTML.//p[normalize-space(text())=’Clean Text’]
not()not(boolean)Inverts a condition. Useful for excluding elements, such as hidden fields or specific classes.//input[not(@type=’hidden’)]
last()last()Selects the last element in a node-set. Great for grabbing the last page in a pagination sequence.//ul[@class=’pagination’]/li[last()]
position()position()Targets an element based on its index position in the node-set.//tr[position() < 4]
count()count(node-set)Returns the number of nodes in a given set. Useful for data validation during scraping.count(//div[@class=’item’])
string-length()string-length(string)Returns the character count of a string. Useful to filter out empty tags or find descriptive paragraphs.//p[string-length(text()) > 100]

These functions empower you to write intelligent selectors that understand the page’s content and context, rather than just the raw tag structure.

XPath attribute contains() Deep Dive: The Most Misused Function

Contains() is the one function that can make or break your data extraction efforts. Many modern websites use dynamically generated CSS frameworks (such as Tailwind CSS or styled-components) and JavaScript frameworks (such as React or Vue). Finding an element by partial text or partial attribute is necessary because classes and IDs change whenever the page loads.

This section shows you how to use XPath’s “contains” operator to solve real-world problems.

contains() with text(): Selecting Nodes by Partial Text Content

When you know what an element says but not where it is or if the text might have extra spaces, the best way to find it is to use XPath contains text. To find text in XPath, use the contains() function with the text() node function.

To find a button that says “Submit Order” but may have extra spaces at the end, use this code: //button[contains(text(), ‘Submit’)].

If you want to find an XPath text that contains a heading, use this: //h2[contains(text(), ‘Specifications’)]

One helpful trick for text containing XPath queries is handling data that isn’t clean. Text() might not work if the text is inside child elements or spans. If that’s the case, use a dot (.) to check the string value of the node and all of its children: //div[contains(., ‘Total Price’)]

contains() with attributes: Partial Class Names and URLs

Websites often use utility classes or add random hashes to IDs, like class=”btn primary-btn flex-row” or id=”user-12849.” An XPath attribute has a query target that points to the static part of the attribute.

To find a link with a URL that has a certain part in it, use this code: //a[contains(@href, ‘/category/electronics’)]

If you need an XPath solution that works with dynamic class names, use this: //div[contains(@class, ‘product-card’)]

contains() vs = (Exact Match)

It’s imperative to know when to use XPath’s “contains” operator instead of the equals sign (=).

Exact match (=): //a[text()=’Login’]. There can’t be any extra spaces, returns, or HTML tags in the text. It must be exactly “Login.”

Partial match (contains()): //a[contains(text(), ‘Login’)]. Matches “Login,” “User Login,” or “Login now.”

Use an exact match if you want an XPath for a contains text match. But for scraping, contains() is much safer because web developers often add invisible line breaks to the DOM.

Common Mistakes to Avoid

  • Case Sensitivity: XPath 1.0, which is used by most browsers and scraping libraries, is case-sensitive. contains(text(), ‘login’) will not match “Login.” You have to use workarounds like translate() to ensure case doesn’t matter, but it’s usually easier to just match the expected capitalization.
  • If your HTML is <p></p>, using text() on nested nodes will fail.Price: <span>$10</span></p> If you use //p[contains(text(), ‘Price: $10’)], it won’t work because text() only gets the direct text node of the <p> tag (“Price: “). To fix this, use //p[contains(., ‘Price: $10’)].
  • Spaces are essential, so don’t forget about them. It won’t work if your target is “Log In” and you use contains(text(), ‘Login’). For the most safety, use //a[contains(normalize-space(text()), ‘Log In’)] with it.

XPath for Web Scraping: Real-World Patterns

It’s one thing to know the syntax, it’s another to put it together into scraping patterns. These are the exact XPath patterns that web scrapers use to navigate complex, modern websites.

Choosing parts with dynamic or partial class names

We talked about how modern web frameworks make classes like styles__ProductCard-sc-12345. You can skip the gibberish hash by going straight to the semantic part of the class name: //div[contains(@class, ‘ProductCard’)]

Going to the “Next” page is a standard part of web scraping. Instead of using structural paths that change depending on how many pages there are, find the button by its text or aria-labels: //a[contains(text(), ‘Next’) or contains(@aria-label, ‘Next Page’)]

Targeting elements using sibling text

One of the most common scraping challenges is extracting data from a table or list of details when the class names are the same. You need the name next to the label “Author:.” You use the previous sibling to lock on and the next sibling to get the following sibling: //span[text()=’Author:’]/following-sibling::span[1]

Getting attributes (href, src, data-*)

Scraping isn’t just about text, it’s also about links and media. If you want to get all the high-resolution images from a product page that loads slowly, you could use the custom data attribute instead of the standard source: //img[@class=’product-image’]/@data-hires-src. 

Handling tables and repeated structures

You usually only want specific columns when scraping data grids. For example, you can get the third column (Price) of every row in a table by combining axes and positions: //table[@id=’pricing’]/tbody/tr/td[3].

A lot of the time, when you use Chrome DevTools to automatically create selectors (Right-click → Copy → Copy XPath), you’ll get something terrible like /html/body/div[2]/div/div[3]/ul/li[4]/a. This stops working as soon as the page changes. Using the methods above to write your own logical patterns will make them last.

If XPath finds the right element but the data isn’t there yet, the page is probably dynamic. If you’re getting data from pages that load content on their own, check out our guide on how to scrape Ajax and JS websites.

Chaining contains() with parent:: for upward navigation

One of the most powerful patterns in production scrapers is combining a text-based contains() predicate with the parent:: axis. This lets you locate an element by its readable label and then step up to its containing block. For example, to find the div that wraps any span whose text includes “Price”:

//span[contains(text(),’Price’)]/parent::div

This is more stable than targeting the div directly, because you’re anchoring the path to visible text that is unlikely to change, rather than a structural class that may be regenerated.

contains() with AND / OR for multiple conditions

You can combine multiple contains() calls inside a single predicate using the XPath boolean operators and and or. This is essential when you need to match elements that satisfy two conditions at once — for example, a button that has both a specific class fragment and specific text:

//div[contains(@class,’btn’) and contains(text(),’Submit’)]

Using or works the same way and is useful for pagination, where the clickable element might say either “Next” or carry an aria-label: //a[contains(text(),’Next’) or contains(@aria-label,’Next page’)]

XPath in Web Scraping Tools: Code vs. No-Code

The tech stack you use will determine how you use XPath. XPath is the language that everyone uses for web scraping, whether they are writing Python code or using a visual data extraction tool.

Code-Based Implementations

Python is the industry standard for developers and supports libraries such as lxml, Scrapy, and Selenium. XPath is a key part of the backend routing and data-extraction schemas for Python scraping libraries such as lxml, Scrapy, and Selenium.

This is how you could use our XPath contains text patterns in Python with Scrapy:

# Scrapy example extracting product titles using contains()
def parse(self, response):
    # Extracts titles from elements where the class contains 'title.'
    titles = response.xpath("//h2[contains(@class, 'title')]/text()").getall()
    for title in titles:
        yield {'title': title.strip()}

And a quick example using Selenium to click a dynamic button:

# Selenium example using XPath axes to click a parent container
button = driver.find_element(By.XPATH, "//span[contains(text(), 'Add to Cart')]/parent::button")
button.click()

No-Code Implementations with Octoparse

Not all scraping projects require a Python script tailored to them. Octoparse, ParseHub, and WebScraper.io are examples of no-code and low-code tools with powerful visual interfaces for data extraction.

For instance, you can use Octoparse’s point-and-click interface to automatically create selectors. But if you come across a website that is very complex or poorly organized, Octoparse lets you enter custom XPath expressions directly into the workflow. In other words, you can use the advanced contains() functions and axes we talked about without having to write boilerplate code, deal with proxies, or set up headless browsers. 

Octoparse handles the infrastructure — proxies, rendering, scheduling — so you can focus on writing the right XPath, not maintaining the stack around it.

Octoparse’s built-in XPath generator automatically strips volatile class hashes and tracking attributes, then synthesizes a stable relative path. If you want to override or fine-tune, paste your own contains() expression directly into the selector field — no headless browser setup, no proxy configuration, no boilerplate.

Try building your first XPath-powered scraper in Octoparse — no Python setup required. Start free →

xpath web scraping octoparse

How Octoparse Generates Robust XPath Selectors behind the scenes

We tested the selector generator on three e-commerce sites using Tailwind CSS dynamic classes — the generated XPath held stable across 50+ page reloads without modification.

The moment you select an element, Octoparse initiates a specialized pipeline to architect a concise and stable XPath:

  1. DOM Feature Extraction
    The engine collects tags, ID values, class names, and the structural DOM hierarchy to use as main locator candidates.
  2. Semantic Noise Reduction
    Volatile class hashes, tracking attributes, and irrelevant text are aggressively purged, leaving behind only the most reliable semantic signals.
  3. Uniqueness Search
    A rigorous combinatorial search evaluates feature sets to identify the most streamlined combination that guarantees unique element targeting.
  4. Optimal XPath Synthesis
    The system makes a refined, change-resistant selector because stability is more important than semantic attributes and text.

The Octoparse generator can handle the challenges posed by modern websites. It gets rid of dynamic noise and likes elegant relative expressions over brittle absolute paths. Normalizing text from other countries, such as English and Chinese, and numbers ensures that your scraping logic works across the whole web.

octoparse xpath structure finding

Octoparse XPath selector panel on a product page with dynamic class names.

XPath vs. CSS Selectors: Quick Reference

The debate between XPath and CSS selectors has been ongoing since the advent of web scraping. CSS selectors are easier to read and type for simple tasks, but XPath is the best way to move around. This is a quick decision matrix to help you pick the right tool for the job.

FeatureXPathCSS Selector
Syntax StylePath-based (//div/p)Styling-based (div > p)
Select by Text ContentYes (contains(text(), ‘val’))No (Requires external JS/Regex)
Upward TraversalYes (parent::, ancestor::)No (CSS4 has :has(), but support varies)
Sibling TraversalBoth directions (following, preceding)Forward only (+, ~)
Attribute MatchingExtensive (starts-with, contains)Yes ([attr^=val], [attr*=val])
Browser SupportUniversalUniversal
Scraping Tool SupportExcellent (Scrapy, Selenium, Octoparse)Excellent (BeautifulSoup, Puppeteer)

The bottom line: CSS selectors work well with simple, well-structured HTML where IDs and classes don’t change. When you need to target text content, navigate a complex DOM hierarchy, or go up in the DOM, XPath always wins.

If you work with dynamic pages, it’s just as important to know how they load as it is to know your selectors. Check out our guide to dynamic web pages.

Quick-Reference XPath Cheatsheet Table

Use this master table when you’re in the middle of a scraping project and need to get the syntax just right. Save this page as a bookmark, then copy and paste these templates directly into your scraper.

GoalXPath TemplateExample
Select by exact class//tag[@class=’exact-name’]//div[@class=’product-grid’]
Select by partial class//tag[contains(@class, ‘partial’)]//button[contains(@class, ‘btn-primary’)]
Select by exact text//tag[text()=’Exact Text’]//a[text()=’Read More’]
Select by partial text//tag[contains(text(), ‘Partial Text’)]//h1[contains(text(), ‘Review’)]
Select text in nested tags//tag[contains(., ‘Nested Text’)]//div[contains(., ‘In Stock’)]
Select parent node//tag/parent::tag or //tag/..//span[@id=’price’]/..
Select a specific sibling//tag/following-sibling::tag[1]//dt[text()=’Weight’]/following-sibling::dd[1]
Select by multiple attributes//tag[@attr1=’val1′ and @attr2=’val2′]//input[@type=’text’ and @name=’search’]
Select the nth child//tag[position()=n]//ul[@id=’menu’]/li[3]
Select the element lacking an attribute//tag[not(@attribute)]//img[not(@alt)]

Conclusion

Using auto-generated selectors is the fastest way to break web scrapers and lose data. If you take the time to learn how to generate XPath by hand, you can go from being someone who just points and clicks to a data extraction expert who can work in the most hostile DOM environments.

The real strength of XPath is its logical flexibility. You can make scrapers that can handle site updates, dynamic class changes, and layout changes by using structural axes like following-sibling:: and mastering text-based functions like contains().

For your next project, make sure to save this XPath cheatsheet. If you want the accuracy of advanced XPath but don’t want to keep up with backend Python infrastructure, you should try Octoparse. It provides the best environment for running complex XPath expressions within a robust, visual framework. Learn these formulas, use them correctly, and this will be the last Xpath Cheatsheet you ever need.

FAQs about XPath

  1. What is the main difference between XPath and CSS selectors?

The main difference is their traversal capabilities. CSS selectors can only move forward (down the tree) and select based on styling attributes (classes, IDs, tags). XPath is more powerful as it allows both forward and backward (upward/ancestor) traversal using axes like parent:: and ancestor::, and it can select elements based on their text content using functions like contains(text(), ‘value’).

  1. Why should I use relative XPath (//) instead of absolute XPath (/) for web scraping?

Absolute XPath starts from the document root (e.g., /html/body/div…) and is extremely brittle. If a single element is added or removed near the top of the page (like a wrapper <div>), the path breaks. Relative XPath (//) searches the entire document for matching nodes, making the selector much more resilient to minor layout changes.

  1. When should I use contains() instead of an exact match (=)?

You should use contains() when an attribute (like @class or @id) is dynamically generated or contains multiple utility classes (e.g., class=”btn primary-btn flex-row”), and you only need to match a partial, static part. You should also use it for text content if there might be unpredictable whitespace or line breaks, or if you only need to match a phrase within a longer block of text.

  1. What is the purpose of the dot (.) in an XPath expression like //div[contains(., ‘Text’)]?

The dot (.) represents the string value of the current node and all its descendants. It is crucial when the text you are trying to match spans the current tag and its nested child tags (e.g., <p>Price: <span>$10</span></p>). Using text() would only return the <p> tag’s direct text, which might not include the whole string.

  1. How do I select an element based on its position, like the last item in a list?

You use the position() or last() function. To select the last element in a node-set, use [last()]. For example: //ul[@class=’items’]/li[last()]. To select an element by its index (e.g., the third one), use [position()=3] or simply [3].

  1. Does contains() work the same way in XPath 2.0?

The core contains() function is identical in XPath 2.0, but XPath 2.0 also introduces matches(), which accepts a full regular expression as its second argument. For example, matches(@id, ‘^post-\d+$’) match any ID that starts with “post-” followed by digits. Most browsers and scraping libraries still run XPath 1.0 by default, so contains() remains the safer cross-environment choice unless you are certain your runtime supports XPath 2.0.

  1. How do I use contains() when the string I am searching for contains a quote character?

XPath 1.0 does not support escape sequences inside string literals, so you cannot write contains(@title, “it’s”) directly. The workaround is the concat() function, which lets you build the string from separate parts using alternating quote styles: contains(@title, concat(‘it’, “‘”, ‘s’)). This assembles the apostrophe from a double-quoted literal, avoiding the quoting conflict entirely.

  1. When should I use the CSS attribute selector [class*=’val’] instead of XPath contains(@class,’val’)?

For simple class substring matching on a modern page, the CSS selector [class*=’btn’] is slightly faster because browser engines evaluate CSS natively. Switch to XPath contains(@class,’btn’) when you also need text content matching, upward axis traversal, or sibling selection in the same expression — all things CSS cannot do in a single selector.

  1. How do I test XPath expressions directly in the browser?

Open Chrome or Edge DevTools (F12), switch to the Console tab, and type $$x(‘//your-expression-here’). The browser evaluates the XPath against the live DOM and returns a list of matching nodes you can inspect inline. For example,$$x(‘//div[contains(@class,”product-card”)]’) returns every matching element immediately. Firefox supports the same $x() shorthand in its console.

Get Web Data in Clicks
Easily scrape data from any website without coding.
Free Download

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Free Download

Related Articles