Locate elements with XPath

Sunday, April 08, 2018 9:20 AM

What is XPath? How does it work in Octoparse?

XPath is a language that allows you to locate specific elements from a page. Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated by clicking elements during the task configuration.

Octoparse allows you to modify XPath so that we can precisely locate the data we are going to scrape. If you would like to learn more about XPath, here’s the tutorial for your reference: https://www.w3schools.com/xml/xpath_intro.asp

 

When should I use XPath?

In most cases, you don’t need to write the XPath on your own. But there are some situations where you might have to do some modification for better locating the data on the webpage.

(This is our advanced tutorials. Before using the XPath, we suggest you learn a little and get more familiar with Octoparse.)

  • Extract data in irregular location
  • Extra data or missing data
  • Pagination without "Next" button 
  • "Next" button cannot be located precisely.
  • Drop-down menu without switch loop

 

Where can I modify XPath in Octoparse?

To modify XPath in Octoparse:

Select the data field that needs to be modified, select customize data field

Select "Customize XPath:

 

Enter the new XPath in Matching XPath textbox

 

For steps like "Loop Item" for pagination or switching drop-down, you can easily find the XPath textbox under "Advanced Options". Enter the new XPath and click "OK" to save your changes.

 

How to write XPath?

If you are new to XPath, you might need to grab some basics of HTML first. XPath locates elements based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page first. 

(More tutorials about HTML )

 

We suggest you use firebug plugin (a Firefox plugin). Firebug is very useful for looking up the element of an HTML document.

(Firebug is now only available for old versions of Firebox. Get the old versions of Firebox here .)

 

Open a webpage in Firefox, click Firebug button and click an element in the page to inspect. It will bring out all of the XPath.

 

Octoparse also provides extra help with XPath generation – XPath tool  . You use Octoparse XPath tool to easily generate a working XPath syntax by setting up the appropriate criteria. You can easily find the XPath tool in "Tools" box.

 

Common XPath expressions used in Octoparse

In this tutorial, we will go through some basics and common XPath used in Octoparse.

Expression

Meaning

.

Selects the current node

//*

Select all elements

.//

Selects elements starting from the current node

@

Selects attributes

.//div

Selects all <div> elements one or more levels deep in the current context

//li[a]

Selects the <li> elements which enclose an <a> element

//li[a or h2]

Selects the li elements which enclose either an <a> or an <h2> element.

.//div[@class='publish-time']

Selects only the <div> elements which has an class attribute that is “publish-time”

.//*[text()='Next']

Selects all text that is “Next”

//a[contains(text(), ‘Next’)]

Selects the <a> elements which contains “Next” text

.//*[contains(@class, 'name')]

Selects all the <class> attributes that contain “name” string

following-sibling

Selects all siblings after the current node

//h1/following-sibling::p[1]

Select the first <p> element after <h1> 


 

XPath is very powerful and this tutorial is just an introduction to the basic concepts.

 

 

If you want to learn more about it, check out these resources:

https://www.w3schools.com/xml/xpath_intro.asp

https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx

https://en.wikipedia.org/wiki/XPath

 

btn_sidebar_use.png
btn_sidebar_form.png