Step-by-step tutorials for you to get started with web scraping
Download OctoparseLocate elements with XPath
Wednesday, November 24, 2021The latest version for this tutorial is available here. Go to have a check now!
What is XPath? How does it work in Octoparse?
XPath is a language that allows you to locate specific elements from a page. Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated by clicking elements during the task configuration.
Octoparse allows you to modify XPath so that we can precisely locate the data we are going to scrape. If you would like to learn more about XPath, here’s the tutorial for your reference: https://www.w3schools.com/xml/xpath_intro.asp
When should I use XPath?
In most cases, you don’t need to write the XPath on your own. But there are some situations where you might have to do some modification for better locating the data on the webpage.
(This is our advanced tutorials. Before using the XPath, we suggest you learn a little and get more familiar with Octoparse.)
- Extract data in irregular location
- Extra data or missing data
- Pagination without "Next" button
- "Next" button cannot be located precisely.
- Drop-down menu without switch loop
Where can I modify XPath in Octoparse?
To modify XPath in Octoparse:
Select the data field that needs to be modified, select customize data field
Select "Customize XPath.
Enter the new XPath in Matching XPath textbox
For steps like "Loop Item" for pagination or switching drop-down, you can easily find the XPath textbox under "Advanced Options". Enter the new XPath and click "OK" to save your changes.
How to write XPath?
If you are new to XPath, you might need to grab some basics of HTML first. XPath locates elements based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page first.
We suggest you use firebug plugin (a Firefox plugin). Firebug is very useful for looking up the element of an HTML document.
(Firebug is now only available for old versions of Firebox. Get the old versions of Firebox here .)
Open a webpage in Firefox, click Firebug button and click an element in the page to inspect. It will bring out all of the XPath.
Octoparse also provides extra help with XPath generation – XPath tool . You use Octoparse XPath tool to easily generate a working XPath syntax by setting up the appropriate criteria. You can easily find the XPath tool in "Tools" box.
Common XPath expressions used in Octoparse
In this tutorial, we will go through some basics and common XPath used in Octoparse.
Expression |
Meaning |
. |
Selects the current node |
//* |
Select all elements |
.// |
Selects elements starting from the current node |
@ |
Selects attributes |
.//div |
Selects all <div> elements one or more levels deep in the current context |
//li[a] |
Selects the <li> elements which enclose an <a> element |
//li[a or h2] |
Selects the li elements which enclose either an <a> or an <h2> element. |
.//div[@class='publish-time'] |
Selects only the <div> elements which has an class attribute that is “publish-time” |
.//*[text()='Next'] |
Selects all text that is “Next” |
//a[contains(text(), ‘Next’)] |
Selects the <a> elements which contains “Next” text |
.//*[contains(@class, 'name')] |
Selects all the <class> attributes that contain “name” string |
following-sibling |
Selects all siblings after the current node |
//h1/following-sibling::p[1] |
Select the first <p> element after <h1> |
XPath is very powerful and this tutorial is just an introduction to the basic concepts.
If you want to learn more about it, check out these resources:
https://www.w3schools.com/xml/xpath_intro.asp
https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx
https://en.wikipedia.org/wiki/XPath
Download Octoparse to start web scraping or contact us for any
question about web scraping!