[Octoparase User Review] By Fabrice Siebert – Basic Plan User
A nightware for a web crawler without using any tools!
I have been crawling and parsing websites for a while, with use of php and cUrl. Years after years, it sounded clear that my extracting routines running on my server were more and more difficult to maintain in a good working shape. In fact, websites regularly change minor things on their pages; and in the best case, you wouldn’t get some or all of the awaited data anymore, in the worse case, absolutely inaccurate data.
Then came for me (and I must admit, my limited skills) THE hammer: AJAX ! Yes, HTML + Javascipt + CSS + DOM… And the dynamic pages that don’t load at first sight, that wait for you to click on a button, that just show as you scroll down, that exchange static pictures URLs with Javascipt dynamically shown pictures… In two words: a nightmare!
So, I had to find a way to still be able to extract my needed data, without having to pass an engineer degree in information technology… had to be fast, had to be robust!
Why I choose Octoparse?
I gave a try to some scraping tools, and my final choice was made to Octoparse. Several reasons for it :
- Easy to set up.
- Lots of tutorials to start easily.
- Ajax is handled as easy as a basic HTMLURL… as if it wouldn’t be any Ajax routines on the pages. It’s really what make me give a try… because I was unable to access the most important part of the data I needed… hidden behind an ‘Display’ Ajax button that I wasn’t able to deal with (with php / cUrl)
- 10 tasks are offered for free.And as far I know, they won’t be public tasks as it’s the case with some of Octoparse competitors.
- Smart Mode and Wizard mode make it easy to find the data, often at first sight. Sometimes you need to find alternate ones… but Octoparse tries to do it for you.
Advanced Mode: the most important part!
But of course, the Advanced Mode is the most important part. You don’t need to start with it : Start with smart, or with wizard, and then Edit in Advanced Mode… and extract with accuracy what you need.
I’ve been using kind of Xpath for years with php… but here, its easy and clear. You can even save a data extraction configuration files, to be used in new project, or elsewhere.
The only drawback I have noticed, is that Octoparse uses mostly children/children/children XPath ways, that seems, to me, less robust than locations with specific attributes like class, ID, or others, when Wizard Mode is used. But of course, you can make it more robust and edit it in the advanced mode.
Formatting the data before exporting them is now easy, and helps to shrink the volume of data.
I’ve not been using Octoparse for a long time, but it should definitively help me to gain a lot of time… and money (as far as I’m able to set up the APIs… 😉