Modify XPath Manually in Octoparse
Tuesday, October 11, 2016 8:44 AM
You would find that XPath is often used in data extraction. And I bet you must feel confused as you don’t know when and how to use XPath when extracting data. I have summed up some situations where you’d better use XPath.
How to modify XPath
Before we learn when to use XPath in data extraction, we need to know how to inspect XPath. Firefox and Chrome are both available. The previous blog XPath Introduction -- Use XPath to Scrape Web Data has shown you how to use Firefox and Chrome to inspect XPath, with which you could directly copy the XPath to extract the data you want. However, sometimes we need to write the XPath on our own.
If you don’t know how to write an XPath, our XPath Tool could help you. Except the step “Go To Web Page”, almost every step has a link to “Try XPath Tool”.
You could also click the XPath Tool in the menu list. The interface is shown below.
Simply enter the target URL in the built-in browser and you could get the HTML. You could automatically generate and match the XPath. Thus you could define the page or the value you want.
Concepts in XPath Tool
Item Tag Name: The tag you want to extract, like span, a, hr and br.
Item Position: Locate the position you want. For example, if you enter “1”, it means the value you want to extract is in the first one.
Item ID/Name/Style Class: Define the attribute values. There are other attributes in HTML and you could change the attribute title. For example, <span lang="en-US">, “lang” is also an attribute. Enter the value (en-US) in the symbol “” after equal sign in the attribute text box and then you could change the attribute manually.
Item Text: The text you want to extract.
Item Text Contains/Start With: Something that contains or starts with in your text that you want to extract.
(Note: Click HERE to know the concepts of XPath document.)
When to modify XPath
On particular occasions you need to use the XPath.
1. Extract Data in Irregular Location
You could see the example below (http://www.jabong.com/clothing/). The first two items don’t have previous price while the following items have. In this case if you want to extract the current price, you should define the value.
Inspect the XPath in the Firefox. You could find that the parent element of current price is a div tag while the parent element of previous price is a span tag.
Therefore, you could directly change the child element of the div tag. I will show you how to write the XPath by using our XPath Tool.
Click the field you want to change.➜ Click "Customize Field". ➜ Click "Define ways to locate an item".
Enter the tag and attribute value in the text box and then click “Generate” and “Match”.
Copy and replace the XPath in the “Relative XPath” text box, and then click “OK” and save. Now you could locate the exact item you want and extract the current price.
2. Extra Data or Missing Data
You could find that there is no discount information above in the first item while the latter has one, but you want to extract such information. By using XPath you can do that.
Add a blank field when extracting data. ➜Choose the field and click "Customize Field" to locate the item. ➜Choose "Define ways to locate an item" and paste the XPath inspected in Firefox in the "Matching XPath" text box. ➜Click "OK" and "Save".
PS - Sometimes adding a blank field will not extract any data from the site. In this case, you can click any piece of data on page and create a field, instead of making a blank field.
And then choose "Customize Field" again. Choose "Define data extracted" and Click "Extract Text" under "Extract data from page content" options.➜Click "OK" and "Save".
Now you could manually check the configuration rule and find that the discount information is extracted if the item has such detail.
Another way to extract such information is to change the XPath through the previous element. Extract the current price and then change the XPath to match the discount. You could find that the XPath of discount is similar to current price except the span tag.
Change the span tag directly and then you could get the information you want.
For missing data, one reason is that some loop items don’t provide the information itself in the loop process, like the discount information above. The other is that you don’t set the proper AJAX timeout. Or data you want to extract is not in the right location. In this case you could change XPath in the the same way to locate the item.
3. Pagination without “Next” Button
For those query string websites without “Next” button, a page navigation action couldn’t be added directly with simple clicks. You need to use the XPath to define the item manually.
Let’s take asta.org for example.
( URL of the example: http://web.asta.org/iMIS/ASTA/Directory?navItemNumber=11304 )
After you click the Find button, you will go to the 1st result page.
You would find that the current page, the 1st page, is always a <span> tag while other pages are <a> tags. See the GIF file below.
The loop item you need is the next page of the current page. In XPath expression, “Following-sibling::” selects all sibling elements after the current node. Use “..” to back to the parent element of the current node. In this case, you can use “following-sibling::” to write the XPath and locate the next page you want.
If you don’t know how to write the XPath, you could use our XPath Tool.
Firstly, copy the XPath of the current page, the 1st page, from FireBug. The XPath is
.//*[@id='ctl01_TemplateBody_WebPartManager1_gwpciNewBPDirectorySearchCommon_
ciNewBPDirectorySearchCommon_gvResults']/tbody/tr[1]/td/table/tbody/tr/td[1]/span
Secondly, open our XPath Tool and paste the XPath of the current page in the “XPath” text box.
Thirdly, click "Match".
Then, click "Parent". ➜ Click "Generate". ➜ Click "Match" to back to the parent element of the current node.
Since you only need the next page of current page, you need to define the location. In this example we need the XPath of the second page, and you will find that currently the XPath is still located to the first page.
In our XPath Tool, clicking the "Next" button will generate "following-sibling::".
So, click "Next". ➜ Enter “td” in the "Item Tag Name" text box. ➜Choose “Item Position” and enter “1”. ➜Click "Generate". ➜Click "Match" to get the final XPath you want.
Now we go back to Octoparse. In the Loop Item, just copy and paste the XPath you get in the "Single element" text box under "Advanced Options". Thus you can get the loop item you want.
4. Drop-down Menu without Switch Loop
For those values in the drop-down menu that can’t be extracted by selecting “Loop switch combobox” directly, you need to use the XPath to define the value manually.
Let’s take eBay for example (URL of the example: http://www.ebay.co.uk/motors).
Just copy the XPath inspected in the Firefox and paste it in the "Variable list" text box under "Advanced Options" of "Loop Item". You can get the loop item you want now.
5. Extract Particular Values
If you want to extract particular values, you could also use the XPath to precisely locate the data you want. Follow the similar steps above.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Author's Picks
Pagination: Scrape Data from Websites with Query Strings (1)
Pagination: Scrape Data from Websites with Query Strings (2)
Octoparse Smart Mode -- Get Data in Seconds
Top 30 Free Web Scraping Software