Getting started with XPath 1Wednesday, April 20, 2016 11:02 AM
Before reading the article, it's strongly recommended to view these two
articles for better understanding HTML and XPath.
Sometimes users need to use XPath expressions to locate any type of information on a webpage.
Octoparse provides an XPath engine for HTML documents so that we can precisely locate the data on a webpage. Here, we will introduce some useful tools we need to get our hands on. They are extensions to the Firefox browser called ‘FireBug’ and ‘FirePath’.
As one of most popular Firefox add-ons, FireBug is used to easily look up the HTML/SCC of any element on a webpage so that makes it very easy to debug and develop webpages. FirePath is a FireBug extension that adds a development to edit, inspect and generate XPath expressions, CSS 3 selectors and JQuery selectors with auto completion for XPath. Users who don’t know much of anything about XPath will benefit a lot from FirePath.
In this tutorial we will learn how to install FireBug and FirePath and how to use these two tools to edit and generate XPath expressions.
Installation of FireBug and FirePath
The installation process would be bifurcated into two step as follows.
Launch the Mozilla Firefox browser and open the ‘Open menu’ to select the Add-ons section.
In the Add-ons Manager page, enter FireBug in the search bar in the top right corner of the browser. Then hit the Install button.
Enter FirePath in the search bar and Install it.
Back to the start page and navigate to a example link. Once the browser has loaded the page, click on the Firebug icon in the top right corner. The firebug window should appear as below:
For this example link we are going to keep things very simple and easy. We would find the HTML code on this page that refers to that web elements.
Here, we first click on the ‘ Inspect’ button in FireBug and then hover the cursor over the content of the web page. You will see blue borders appear when you move the cursor. When you click ￥41908, you can see the following appear in FireBug:
FireBug and FirePath have found the HTML code for the content ‘￥41908’ on the page. Right click the corresponding HTML code and select the option ‘Copy XPath’. The XPath expression is “.//*[@id='gdp']”.
Use XPath expression in Octoparse
You can use this XPath expression in Octoparse as well. Let’s understand the process step by step as follows.
Launch Octoparse and build a new task with Advance mode. Drag a ‘Open page’ button to the workflow designer.
Open “http://192.168.0.4/xpath.html” and click ‘Save’.
After the web page is loaded, click the ‘7.2%’ in the web page and select the ‘Extract text’ option.
In the Define Fields table, Click the field we just extracted and click on the ‘Customize Field’ button and select the second option ‘ Define ways to locate an item’.
In the ‘Matching XPath’ bar, paste the XPath expression “.//*[@id='gdp']” we just copied from FireBug. Then click OK. (You can edit the XPath expressions by using our ‘XPath Tools’.)
We would see the content extracted has been changed to ‘￥41908’.
We can edit XPath expressions by ourselves! Let’s take a look at the HTML code.
The XPath expression for ‘￥41908’ can be written as .//*[@id=’gdp’]. Why?
Let’s glance the meaning of the operators in sequence:
So this XPath expression means to select all elements in the document that have a “id” attribute with a value of “gdp”. Easy to understand, right?
We can try to edit XPath expression manually. For example, we are going to fetch “Growth Rate: 7.2%” and the HTML code is <span id="rate">7.2%</span>.
So the XPath expression should be //span[@id=’rate’] or //span. Let’s put these two path expressions separately into Matching XPath bar to see if they really work. Then click the OK button.
Paste the first path expression into the Matching XPath box and click OK.
Paste the second path expression into the Matching XPath box and click OK.
And we have extracted the same element “Growth Rate” by these two path expressions. Below is the screenshot of the result.
We have learned how to edit and create XPath expressions by using some useful tools. We also provide additional documents, Getting started with XPath 2, to help you better handle Octoparse.
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.