Step-by-step tutorials for you to get started with web scraping

Download Octoparse

How to associate data with nearby text?

Thursday, August 16, 2018

Octoparse tracks data with XPath but data can change location within a web page. To tackle this, we will show you how you can extract data more accurately by associating it with a text nearby. 

First, let’s look at an example of when this technique can be useful.  

web scraping with octoparse - associate with nearby text

In the example image above, the value for "Product Dimensions" is located next to the words "Product Dimensions". Similarly, the value for "Item Weight" will always be found next to the words "Item Weight". The same pattern should apply to the rest of the list. 

While "Product Dimensions" might change places from the first row to the second row of the list, its associated value should always be found next to it. Therefore, a more consistent way to find and capture the associated values of any elements is really to first look for where the words are, then locate the data next to it. In this example, instead of trying to find value "13.4 x 0.3 x 13.4 inches" directly on the page, we can get it captured more accurately when we relate it to the text of "Product Dimensions". 

Follow the steps below to see how it is done:

1) Click on "13.4 x 0.3 x 13.4 inches" to capture the value for "Product Dimensions". Once extracted, select the data field then click on icon web scraping with octoparse - associate with nearby text to customize the field.

web scraping with octoparse - associate with nearby text

2) Click "Customize XPATH"

web scraping with octoparse - associate with nearby text

3) Find the relative XPath relating to the text of the target data field 
  • Now, load the page with Firefox and inspect the target data field with FirePath. Notice the actual words of "Product Dimensions" can be found within the <th> tag while its associated value is found within the <td> tag while right below it.  

web scraping with octoparse - associate with nearby text

  • Once we see the pattern, we can write a relative XPath to look for the value of "Product Dimensions" relative to where we will actually find the words: ".//th[contains(text(), 'Product Dimensions')]/following-sibling::td[1]" - This XPath expression is telling the program to look for the <th> tag containing the text of "Product Dimensions" then find the first <td> tag located right below it. And this will give exactly what we want, the associated value of "Product Dimensions". 

web scraping with octoparse - associate with nearby text

  • Input the new XPath to the text box for "Matching XPath", click "OK" to save the settings. 

web scraping with octoparse - associate with nearby text

Now, Octoparse will always look for the associated value of "Product Dimensions" according to where the words "Product Dimensions" are showing on the web page. Apply this technique to similar fields on the list can help reduce the chance of element not found exceptions.

Tips!

  • Absolute XPath can be understood as a direct way to find an element on a web page, but the disadvantage of the absolute XPath is that if there are any changes to the nested relationships the XPath will fail to locate the target element. 
  • Relative XPath enables searching elements from the page using different tags, attributes, and values. By adding these criteria, you will have a greater chance of locating the element accurately.  
  • Following-sibling is very often used for finding an element located next to another designated element.
  • Learn more about XPATHweb scraping with octoparse - associate with nearby text here!

Related Articles:

What's new in Octoparse 7.X? 

Dealing with AJAX 

Select items in a drop-down menu 

Extract multiple pages through pagination 

Getting started with XPath 1 

Getting Started With XPath 2 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png