If you’ve ever spent hours developing a web scraper only to have it break the day after because a website changed its layout, then you understand the ultimate frustration of a web developer.
That’s exactly what happened to me while building an Instagram post scraper. The elements won’t select, the nodes won’t remain consistent, and those auto-generated browser selectors will fail you when you need them the most. The fact of the matter is, when it comes to reliable web scraping, those selectors are a recipe for disaster. The secret is learning your XPath cheatsheet.
CSS selectors are amazing for basic styling and selecting elements, but when you need to traverse through a complex DOM tree and choose elements based on their actual content, they just don’t cut it. That’s when XPath becomes indispensable and why most production scrapers are built around it.
Quick Answer
| Expression | What it does |
| //tag | Selects matching nodes anywhere in the document (relative path; preferred over absolute /) |
| contains(@attr, ‘val’) | Matches a partial attribute value — safe for dynamic class names and hashed IDs |
| contains(text(), ‘val’) | Selects elements whose visible text includes a given substring |
| //tag/parent::* | Navigates up to the immediate parent element (upward traversal CSS cannot do) |
| //tag/following-sibling::tag[1] | Selects the immediately following sibling — key for scraping label-value pairs |
XPath Basics: Syntax You Need to Know First
You need to know the basics of XPath syntax before you can use advanced functions and navigate complex documents. XPath uses a path-like syntax to navigate the HTML or XML tree structure, just as you move through files and folders on your computer.
The first thing you need to do is learn the difference between absolute and relative paths. Absolute paths start at the root of the document, so they break very easily if you add a single wrapper <div> to the page. Relative paths, on the other hand, look through the entire document, which makes your scrapers much stronger.
Here is a quick reference table of the basic syntax you need to learn by heart:
| Expression | Name | Description & Use Case | Example |
| / | Absolute Node | Selects from the root node. Rarely recommended for web scraping as it breaks easily. | /html/body/div/p |
| // | Relative Node | Selects nodes anywhere in the document from the current node that match the selection. | //div[@class=’product’] |
| * | Wildcard | Matches any element node. Useful when the tag name is unknown or variable. | //*[@id=’main’] |
| . | Current Node | Represents the current context node. Useful in nested loops during scraping. | ./span |
| .. | Parent Node | Selects the immediate parent of the current node. Great for moving up the tree. | //a[@id=’link’]/.. |
| @ | Attribute | Selects an attribute (like class, id, href, or src). | //img/@src |
Mastering these building blocks is non-negotiable. They are the foundation upon which every complex XPath expression is built.

XPath Axes Cheatsheet
Basic syntax tells you how to choose an element, and XPath axes tell you how to move around it. You can move around the DOM tree with axes by looking at how nodes are related to each other. This is XPath’s superpower that CSS selectors just can’t match. If an element doesn’t have a unique ID or class, you can find it by looking at how it relates to a component that does.
This is a complete list of the most common XPath axes used in web scraping:
child::
Explanation: This picks all of the current node’s direct children. Note: //div/child::p and //div/p are the same thing.
For example: //ul[@class=’menu’]/child::li
parent::
Explanation: Chooses the parent node of the current node. It works just like…
For example, //span[@class=’price’]/parent::div
ancestor::
Explanation: This picks out all the ancestors of the current node, from the parent to the grandparent and so on, all the way to the root. Great for locating high-level wrapper containers.
For example: //td[contains(text(), ‘Total’)]/ancestor::table
descendant::
Explanation: This selects all of the current node’s children, grandchildren, and so on. (Like using // after a node).
For example, //div[@id=’content’]/descendant::a
following-sibling::
Explanation: This picks out all sibling nodes that appear after the current node in the HTML document, provided they share the same parent. This is a huge help when scraping definition lists or form labels next to inputs.
For example: //label[text()=’Email’]/following-sibling::input
preceding-sibling::
Explanation: This selects all sibling nodes that come before the current node in the HTML document.
For example, //button[@type=’submit’]/preceding-sibling::input[@type=’text’]
self::
Explanation: This selects the currently selected node. Often used with other predicates to check the properties of a node.
For example: //div/self::*[@class=’active’]
When you use axes, your scraper relies on logical structural relationships rather than on styling classes that change often.

XPath Functions Cheatsheet
XPath is more than just a way to point to a path. It also includes many built-in functions that can process strings, evaluate Boolean expressions, and count nodes. You won’t have to write complicated post-processing code in your scraping script if you know these functions.
Here is a table of the most essential XPath functions that every web scraper should know:
| Function | Syntax & Usage | Plain-English Use Case | XPath Example |
| contains() | contains(string1, string2) | The absolute most important function. Checks if a string or attribute contains a specific substring. Addresses all the contents in the XPath. | //div[contains(@class, ‘button’)] |
| text() | text() | Extracts the text node of an element. Often paired with contains or exact match operators. | //a[text()=’Click Here’] |
| starts-with() | starts-with(string1, string2) | Checks if a string or attribute begins with a specific value. Perfect for dynamic IDs that append random numbers. | //div[starts-with(@id, ‘post-‘)] |
| normalize-space() | normalize-space(string) | Strips leading and trailing whitespace, and replaces multiple spaces with a single space. Highly recommended for messy HTML. | //p[normalize-space(text())=’Clean Text’] |
| not() | not(boolean) | Inverts a condition. Useful for excluding elements, such as hidden fields or specific classes. | //input[not(@type=’hidden’)] |
| last() | last() | Selects the last element in a node-set. Great for grabbing the last page in a pagination sequence. | //ul[@class=’pagination’]/li[last()] |
| position() | position() | Targets an element based on its index position in the node-set. | //tr[position() < 4] |
| count() | count(node-set) | Returns the number of nodes in a given set. Useful for data validation during scraping. | count(//div[@class=’item’]) |
| string-length() | string-length(string) | Returns the character count of a string. Useful to filter out empty tags or find descriptive paragraphs. | //p[string-length(text()) > 100] |
These functions empower you to write intelligent selectors that understand the page’s content and context, rather than just the raw tag structure.
XPath attribute contains() Deep Dive: The Most Misused Function
Contains() is the one function that can make or break your data extraction efforts. Many modern websites use dynamically generated CSS frameworks (such as Tailwind CSS or styled-components) and JavaScript frameworks (such as React or Vue). Finding an element by partial text or partial attribute is necessary because classes and IDs change whenever the page loads.
This section shows you how to use XPath’s “contains” operator to solve real-world problems.
contains() with text(): Selecting Nodes by Partial Text Content
When you know what an element says but not where it is or if the text might have extra spaces, the best way to find it is to use XPath contains text. To find text in XPath, use the contains() function with the text() node function.
To find a button that says “Submit Order” but may have extra spaces at the end, use this code: //button[contains(text(), ‘Submit’)].
If you want to find an XPath text that contains a heading, use this: //h2[contains(text(), ‘Specifications’)]
One helpful trick for text containing XPath queries is handling data that isn’t clean. Text() might not work if the text is inside child elements or spans. If that’s the case, use a dot (.) to check the string value of the node and all of its children: //div[contains(., ‘Total Price’)]
contains() with attributes: Partial Class Names and URLs
Websites often use utility classes or add random hashes to IDs, like class=”btn primary-btn flex-row” or id=”user-12849.” An XPath attribute has a query target that points to the static part of the attribute.
To find a link with a URL that has a certain part in it, use this code: //a[contains(@href, ‘/category/electronics’)]
If you need an XPath solution that works with dynamic class names, use this: //div[contains(@class, ‘product-card’)]
contains() vs = (Exact Match)
It’s imperative to know when to use XPath’s “contains” operator instead of the equals sign (=).
Exact match (=): //a[text()=’Login’]. There can’t be any extra spaces, returns, or HTML tags in the text. It must be exactly “Login.”
Partial match (contains()): //a[contains(text(), ‘Login’)]. Matches “Login,” “User Login,” or “Login now.”
Use an exact match if you want an XPath for a contains text match. But for scraping, contains() is much safer because web developers often add invisible line breaks to the DOM.
Common Mistakes to Avoid
- Case Sensitivity: XPath 1.0, which is used by most browsers and scraping libraries, is case-sensitive. contains(text(), ‘login’) will not match “Login.” You have to use workarounds like translate() to ensure case doesn’t matter, but it’s usually easier to just match the expected capitalization.
- If your HTML is <p></p>, using text() on nested nodes will fail.Price: <span>$10</span></p> If you use //p[contains(text(), ‘Price: $10’)], it won’t work because text() only gets the direct text node of the <p> tag (“Price: “). To fix this, use //p[contains(., ‘Price: $10’)].
- Spaces are essential, so don’t forget about them. It won’t work if your target is “Log In” and you use contains(text(), ‘Login’). For the most safety, use //a[contains(normalize-space(text()), ‘Log In’)] with it.
XPath for Web Scraping: Real-World Patterns
It’s one thing to know the syntax, it’s another to put it together into scraping patterns. These are the exact XPath patterns that web scrapers use to navigate complex, modern websites.
Choosing parts with dynamic or partial class names
We talked about how modern web frameworks make classes like styles__ProductCard-sc-12345. You can skip the gibberish hash by going straight to the semantic part of the class name: //div[contains(@class, ‘ProductCard’)]
Choosing a link for pagination
Going to the “Next” page is a standard part of web scraping. Instead of using structural paths that change depending on how many pages there are, find the button by its text or aria-labels: //a[contains(text(), ‘Next’) or contains(@aria-label, ‘Next Page’)]
Targeting elements using sibling text
One of the most common scraping challenges is extracting data from a table or list of details when the class names are the same. You need the name next to the label “Author:.” You use the previous sibling to lock on and the next sibling to get the following sibling: //span[text()=’Author:’]/following-sibling::span[1]
Getting attributes (href, src, data-*)
Scraping isn’t just about text, it’s also about links and media. If you want to get all the high-resolution images from a product page that loads slowly, you could use the custom data attribute instead of the standard source: //img[@class=’product-image’]/@data-hires-src.
Handling tables and repeated structures
You usually only want specific columns when scraping data grids. For example, you can get the third column (Price) of every row in a table by combining axes and positions: //table[@id=’pricing’]/tbody/tr/td[3].
A lot of the time, when you use Chrome DevTools to automatically create selectors (Right-click → Copy → Copy XPath), you’ll get something terrible like /html/body/div[2]/div/div[3]/ul/li[4]/a. This stops working as soon as the page changes. Using the methods above to write your own logical patterns will make them last.
If XPath finds the right element but the data isn’t there yet, the page is probably dynamic. If you’re getting data from pages that load content on their own, check out our guide on how to scrape Ajax and JS websites.
Chaining contains() with parent:: for upward navigation
One of the most powerful patterns in production scrapers is combining a text-based contains() predicate with the parent:: axis. This lets you locate an element by its readable label and then step up to its containing block. For example, to find the div that wraps any span whose text includes “Price”:
//span[contains(text(),’Price’)]/parent::div
This is more stable than targeting the div directly, because you’re anchoring the path to visible text that is unlikely to change, rather than a structural class that may be regenerated.
contains() with AND / OR for multiple conditions
You can combine multiple contains() calls inside a single predicate using the XPath boolean operators and and or. This is essential when you need to match elements that satisfy two conditions at once — for example, a button that has both a specific class fragment and specific text:
//div[contains(@class,’btn’) and contains(text(),’Submit’)]
Using or works the same way and is useful for pagination, where the clickable element might say either “Next” or carry an aria-label: //a[contains(text(),’Next’) or contains(@aria-label,’Next page’)]
XPath in Web Scraping Tools: Code vs. No-Code
The tech stack you use will determine how you use XPath. XPath is the language that everyone uses for web scraping, whether they are writing Python code or using a visual data extraction tool.
Code-Based Implementations
Python is the industry standard for developers and supports libraries such as lxml, Scrapy, and Selenium. XPath is a key part of the backend routing and data-extraction schemas for Python scraping libraries such as lxml, Scrapy, and Selenium.
This is how you could use our XPath contains text patterns in Python with Scrapy:
And a quick example using Selenium to click a dynamic button:
No-Code Implementations with Octoparse
Not all scraping projects require a Python script tailored to them. Octoparse, ParseHub, and WebScraper.io are examples of no-code and low-code tools with powerful visual interfaces for data extraction.
For instance, you can use Octoparse’s point-and-click interface to automatically create selectors. But if you come across a website that is very complex or poorly organized, Octoparse lets you enter custom XPath expressions directly into the workflow. In other words, you can use the advanced contains() functions and axes we talked about without having to write boilerplate code, deal with proxies, or set up headless browsers.
Octoparse handles the infrastructure — proxies, rendering, scheduling — so you can focus on writing the right XPath, not maintaining the stack around it.
Octoparse’s built-in XPath generator automatically strips volatile class hashes and tracking attributes, then synthesizes a stable relative path. If you want to override or fine-tune, paste your own contains() expression directly into the selector field — no headless browser setup, no proxy configuration, no boilerplate.
Try building your first XPath-powered scraper in Octoparse — no Python setup required. Start free →

How Octoparse Generates Robust XPath Selectors behind the scenes
We tested the selector generator on three e-commerce sites using Tailwind CSS dynamic classes — the generated XPath held stable across 50+ page reloads without modification.
The moment you select an element, Octoparse initiates a specialized pipeline to architect a concise and stable XPath:
- DOM Feature Extraction
The engine collects tags, ID values, class names, and the structural DOM hierarchy to use as main locator candidates. - Semantic Noise Reduction
Volatile class hashes, tracking attributes, and irrelevant text are aggressively purged, leaving behind only the most reliable semantic signals. - Uniqueness Search
A rigorous combinatorial search evaluates feature sets to identify the most streamlined combination that guarantees unique element targeting. - Optimal XPath Synthesis
The system makes a refined, change-resistant selector because stability is more important than semantic attributes and text.
The Octoparse generator can handle the challenges posed by modern websites. It gets rid of dynamic noise and likes elegant relative expressions over brittle absolute paths. Normalizing text from other countries, such as English and Chinese, and numbers ensures that your scraping logic works across the whole web.

Octoparse XPath selector panel on a product page with dynamic class names.
XPath vs. CSS Selectors: Quick Reference
The debate between XPath and CSS selectors has been ongoing since the advent of web scraping. CSS selectors are easier to read and type for simple tasks, but XPath is the best way to move around. This is a quick decision matrix to help you pick the right tool for the job.
| Feature | XPath | CSS Selector |
| Syntax Style | Path-based (//div/p) | Styling-based (div > p) |
| Select by Text Content | Yes (contains(text(), ‘val’)) | No (Requires external JS/Regex) |
| Upward Traversal | Yes (parent::, ancestor::) | No (CSS4 has :has(), but support varies) |
| Sibling Traversal | Both directions (following, preceding) | Forward only (+, ~) |
| Attribute Matching | Extensive (starts-with, contains) | Yes ([attr^=val], [attr*=val]) |
| Browser Support | Universal | Universal |
| Scraping Tool Support | Excellent (Scrapy, Selenium, Octoparse) | Excellent (BeautifulSoup, Puppeteer) |
The bottom line: CSS selectors work well with simple, well-structured HTML where IDs and classes don’t change. When you need to target text content, navigate a complex DOM hierarchy, or go up in the DOM, XPath always wins.
If you work with dynamic pages, it’s just as important to know how they load as it is to know your selectors. Check out our guide to dynamic web pages.
Quick-Reference XPath Cheatsheet Table
Use this master table when you’re in the middle of a scraping project and need to get the syntax just right. Save this page as a bookmark, then copy and paste these templates directly into your scraper.
| Goal | XPath Template | Example |
| Select by exact class | //tag[@class=’exact-name’] | //div[@class=’product-grid’] |
| Select by partial class | //tag[contains(@class, ‘partial’)] | //button[contains(@class, ‘btn-primary’)] |
| Select by exact text | //tag[text()=’Exact Text’] | //a[text()=’Read More’] |
| Select by partial text | //tag[contains(text(), ‘Partial Text’)] | //h1[contains(text(), ‘Review’)] |
| Select text in nested tags | //tag[contains(., ‘Nested Text’)] | //div[contains(., ‘In Stock’)] |
| Select parent node | //tag/parent::tag or //tag/.. | //span[@id=’price’]/.. |
| Select a specific sibling | //tag/following-sibling::tag[1] | //dt[text()=’Weight’]/following-sibling::dd[1] |
| Select by multiple attributes | //tag[@attr1=’val1′ and @attr2=’val2′] | //input[@type=’text’ and @name=’search’] |
| Select the nth child | //tag[position()=n] | //ul[@id=’menu’]/li[3] |
| Select the element lacking an attribute | //tag[not(@attribute)] | //img[not(@alt)] |
Conclusion
Using auto-generated selectors is the fastest way to break web scrapers and lose data. If you take the time to learn how to generate XPath by hand, you can go from being someone who just points and clicks to a data extraction expert who can work in the most hostile DOM environments.
The real strength of XPath is its logical flexibility. You can make scrapers that can handle site updates, dynamic class changes, and layout changes by using structural axes like following-sibling:: and mastering text-based functions like contains().
For your next project, make sure to save this XPath cheatsheet. If you want the accuracy of advanced XPath but don’t want to keep up with backend Python infrastructure, you should try Octoparse. It provides the best environment for running complex XPath expressions within a robust, visual framework. Learn these formulas, use them correctly, and this will be the last Xpath Cheatsheet you ever need.
FAQs about XPath
- What is the main difference between XPath and CSS selectors?
The main difference is their traversal capabilities. CSS selectors can only move forward (down the tree) and select based on styling attributes (classes, IDs, tags). XPath is more powerful as it allows both forward and backward (upward/ancestor) traversal using axes like parent:: and ancestor::, and it can select elements based on their text content using functions like contains(text(), ‘value’).
- Why should I use relative XPath (//) instead of absolute XPath (/) for web scraping?
Absolute XPath starts from the document root (e.g., /html/body/div…) and is extremely brittle. If a single element is added or removed near the top of the page (like a wrapper <div>), the path breaks. Relative XPath (//) searches the entire document for matching nodes, making the selector much more resilient to minor layout changes.
- When should I use contains() instead of an exact match (=)?
You should use contains() when an attribute (like @class or @id) is dynamically generated or contains multiple utility classes (e.g., class=”btn primary-btn flex-row”), and you only need to match a partial, static part. You should also use it for text content if there might be unpredictable whitespace or line breaks, or if you only need to match a phrase within a longer block of text.
- What is the purpose of the dot (.) in an XPath expression like //div[contains(., ‘Text’)]?
The dot (.) represents the string value of the current node and all its descendants. It is crucial when the text you are trying to match spans the current tag and its nested child tags (e.g., <p>Price: <span>$10</span></p>). Using text() would only return the <p> tag’s direct text, which might not include the whole string.
- How do I select an element based on its position, like the last item in a list?
You use the position() or last() function. To select the last element in a node-set, use [last()]. For example: //ul[@class=’items’]/li[last()]. To select an element by its index (e.g., the third one), use [position()=3] or simply [3].
- Does contains() work the same way in XPath 2.0?
The core contains() function is identical in XPath 2.0, but XPath 2.0 also introduces matches(), which accepts a full regular expression as its second argument. For example, matches(@id, ‘^post-\d+$’) match any ID that starts with “post-” followed by digits. Most browsers and scraping libraries still run XPath 1.0 by default, so contains() remains the safer cross-environment choice unless you are certain your runtime supports XPath 2.0.
- How do I use contains() when the string I am searching for contains a quote character?
XPath 1.0 does not support escape sequences inside string literals, so you cannot write contains(@title, “it’s”) directly. The workaround is the concat() function, which lets you build the string from separate parts using alternating quote styles: contains(@title, concat(‘it’, “‘”, ‘s’)). This assembles the apostrophe from a double-quoted literal, avoiding the quoting conflict entirely.
- When should I use the CSS attribute selector [class*=’val’] instead of XPath contains(@class,’val’)?
For simple class substring matching on a modern page, the CSS selector [class*=’btn’] is slightly faster because browser engines evaluate CSS natively. Switch to XPath contains(@class,’btn’) when you also need text content matching, upward axis traversal, or sibling selection in the same expression — all things CSS cannot do in a single selector.
- How do I test XPath expressions directly in the browser?
Open Chrome or Edge DevTools (F12), switch to the Console tab, and type $$x(‘//your-expression-here’). The browser evaluates the XPath against the live DOM and returns a list of matching nodes you can inspect inline. For example,$$x(‘//div[contains(@class,”product-card”)]’) returns every matching element immediately. Firefox supports the same $x() shorthand in its console.




