How to exclude "Ads" items when creating a list?

Wednesday, November 24, 2021

The latest version for this tutorial is available here. Go to have a check now!


When you create a list of items to scrape a website, sometimes the list may include several “Ads” items (Example URL).


What should you do if you only want to scrape the non-ads items?

You just need to modify the XPath of the “Loop Item” to make it only locate the non-ads items.


If we check the source code of the items in the example above with firebug(an FireFox extension), you will see the difference between ads items and non-ads.



Apparently, the class attribute is different. So we can utilize this difference to write the XPath: //li[@class='regular-search-result']


Enter the XPath into Octoparse, you will see the Ads being excluded.



If you are new to XPath, you might need to grab some basics of HTML and XPath first. Here are some tutorials for your reference: HTML basic | XPath basic

