Web scraping tools (also called data extraction tools or web scrapers) help you collect data from websites and store them on your local database or spreadsheets. There are a lot of web scraping tools on the market. Before choosing the right web scraping tool for your business, it’s important to know what each tool provides. I have a very comprehensive comparison chart for the top 5 web scraping tools – Octoparse, Parsehub, Mozenda, Dexi.io, and Import.io.
Overview
Here I will give a brief introduction on these 5 web scraping tools.
|
Characteristics |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
|
Usability |
★★★★★ |
★★★★☆ | ★★★★★ | ★★★★★ | ★★★★★ |
|
Functionality |
★★★★☆ |
★★★★☆ |
★★★★☆ |
★★★★☆ |
★★★★☆ |
|
Easy to learn |
★★★★★ |
★★★★☆ |
★★★★★ |
★★★★☆ |
★★★★★ |
|
Customer support |
Email, phone, training, community support |
Email, live chat, forum |
Phone, email, video chat |
Email, phone, community support |
Email, training, chatbot, community support |
|
Price |
$19-Basic, $89-Standard, $249-Professional |
$149-Standard, $499-Professional |
Starting from $100 per 5000 pages |
$119-Standard, $399-Professional, $699-Corporate |
$299-Essential, $4999-Enterprise annual, $9999-Premium annual |
|
Trial version/Free version |
Free version- free, |
Free version-free |
30 days trial |
Trial |
7 days trial |
|
OS (Specifications) |
Win |
Win, Mac, Linux |
Win |
Win, Mac, Linux |
Win, Mac, Linux |
|
Data Export formats |
Txt, CSV, Excel, databases(MySql, SqlServer, Oracle) |
CSV, JSON |
CSV, TSV, XML, Excel and JSON. |
CSV, Excel, XML, JSON, Zip |
CSV, JSON, Google sheets |
|
Multi-thread |
Yes |
Yes |
Yes |
Yes |
No |
|
API |
Yes |
Yes |
Yes, (specific API) |
Yes |
Yes |
|
Scheduling |
Yes |
Yes |
Yes |
Yes |
Yes |
Build a crawler
The crawler is tasked with scraping data from usually one website with unlimited/limited Page/URL inquiries. Here I will list the most important features when scraping data online.
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
|
Built-in browser |
Yes |
Yes |
Yes |
No |
Yes | |
|
Keyboard shortcuts |
No |
Yes |
No |
Yes |
Yes | |
|
Pagination |
Next button |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Load more |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Numbers |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Infinite scrolling |
Yes, support infinite scrolling times customized setting |
Yes |
Yes |
Yes |
Yes | |
|
Enter Text |
Various keywords |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Combine keywords from two lists |
No |
No |
No |
No | ||
|
Date Inputs |
No |
Yes |
No |
No | ||
|
“In tandem” loop |
No |
No |
No |
No | ||
|
Enter a list of URL/keywords |
A list of URLs |
Yes |
Yes |
Yes |
Yes |
Yes |
|
JSON |
No |
No |
No |
No | ||
|
Update URL lists |
No |
No |
No |
Yes | ||
|
Input document |
No |
Yes, support CSV input |
Yes, support CSV input |
No | ||
|
Select data |
Selecting elements |
Point-and-click, XPath |
Point-and-click, XPath, CSS |
Point-and-click, XPath |
Point-and-click, CSS |
Point-and-click, XPath |
|
Data formats |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. | |
|
Transforming data |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression | |
|
Use scraped data as an input in one project |
No |
Yes, scrape data from a website and use it as an input for another website to scrape data |
No |
No |
No | |
|
Customized serial number |
No |
Yes, adding a number variable that increments on each iteration |
No |
No |
No | |
|
Different kinds of date strings |
Yes |
Yes |
No |
Yes | ||
|
Crawlers/Tasks switch |
Yes, support multi-thread operation |
Yes, support switching to another crawler when configuring a crawler |
No |
Yes |
No | |
Extract data
Some advanced features needed when extracting data:
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
|
Scraping Mode |
Local running and cloud running with Octoparse servers |
Cloud running |
Cloud running |
Cloud running |
Cloud running |
|
Visual Mode |
Yes |
No |
Yes |
No |
Yes |
|
Test Run |
No |
Yes, up to 5 pages |
Yes |
Yes |
No |
|
Extract behind a login |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Scheduling |
Yes, support scheduling tasks in real-time/daily/weekly/monthly |
Yes |
Yes |
Yes, support choosing the local timezone |
Yes |
|
IP Rotation | Yes, support IP rotation and choose a different geo-location before running a task |
Yes |
Yes, support IP rotation and choosing different geo-location before running a task |
Yes |
Yes |
|
Solving Captcha |
Yes, only available for local running |
Yes, only available for Text Input Captcha |
No |
Yes, need to integrate with third-party Captcha solving platform |
No |
|
Error report/debug |
Yes, missing data error report |
No |
Yes, error troubleshoot reminder |
Yes, provide screenshots, error message, debug mode and the execution log |
No |
Get data
Features on how to get data:
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
|
Extraction Speed or cloud servers distribution |
Free plan |
No cloud servers, depending on the local network |
1 worker (approx. 5 pages/minute) |
– |
– |
– |
|
Standard plan |
6 cloud workers, depending on the rule of the crawlers |
4 workers (approx. 20 pages/minute) |
Depending on paid page credits |
1 worker |
Depending on the number of URLs extraction | |
|
Professional plan |
20 cloud servers, depending on the rule of the crawlers |
24 workers (approx. 120 pages/minute) |
Depending on paid page credits |
3 workers |
Depending on the number of URLs extraction | |
|
Concurrent running crawlers |
Free plan |
2 for local running |
1 |
– |
– |
– |
|
Standard plan |
Unlimited for local running, 6 for cloud running |
4 |
2 |
1 |
Depending on the number of URLs extraction | |
|
Professional plan |
Unlimited for local running, 20 for cloud running |
24 |
5 |
3 |
Depending on the number of URLs extraction | |
|
Customized servers |
No |
Yes, servers could be distributed manually |
No |
Yes |
No | |
|
Local running |
Yes |
Only available with test Run |
Only available with Test Run |
No |
No | |
|
Cloud Running |
Yes, different paid plans have different cloud extraction based on the cloud servers |
Yes, different paid plans have different cloud extraction speed |
Yes, different paid page credits have different cloud extraction speed |
Yes, depending on different workers or robots |
Yes | |
|
Notification on task completion |
No |
Yes, email |
Yes, email |
Yes, message |
Yes, email | |
Data export and data storage
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
|
Data export | |||||
|
API |
Yes |
Yes |
Yes |
Yes |
Yes |
|
CSV |
Yes |
Yes |
Yes |
Yes |
Yes |
|
JSON |
No |
Yes |
Yes |
Yes |
No |
|
Google Sheet |
No |
Yes, with API |
Yes |
Yes |
Yes |
|
Tableau |
No |
Yes, integrated with Tableau |
No |
No |
Yes |
|
Web |
No |
Yes |
No |
Yes |
No |
|
Data storage | |||||
|
Free |
No, need to export the data to your own machine |
14 days |
– |
– |
– |
|
Standard |
3 months |
14 days |
1 GB storage |
No |
No |
|
Professional |
3 months |
30 days |
5 GB Storage |
No |
No |
|
Enterprise |
– |
30 days |
50GB Storage |
No |
No |
Solutions
Web scraping tools are used to scrape different kinds of websites. Here I list some typical websites that most people concern.
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
|
Job |
|
No, easy to be detected and banned by LinkedIn anti-web scraping techniques |
No |
No |
No |
No |
|
Glassdoor |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
SNS |
|
Yes |
No |
No |
No |
No |
|
|
Yes |
Yes |
Yes |
Yes |
Yes | |
|
|
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Real estate |
Airbnb | No, the updated website is not compatible with Octoparse built-in browser |
No |
No |
No |
No |
|
Booking |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Realtor.com |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Tripadvisor |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Product details |
Yellowpages |
Yes |
Yes |
Yes |
Yes |
Yes |
|
Yelp |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Amazon |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
e Bay |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Maps |
Google Maps(latitude and longitude data) |
Yes |
No |
No |
Yes | |
|
Others |
|
Yes |
No |
No |
No |
No |
Premium Plans and support
|
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
|
All Paid Plans |
Download Images and Files to Dropbox |
No |
Yes, integrate with Dropbox |
Yes, integrate with Dropbox |
Yes, integrate with Dropbox |
No |
|
Download Images and Files to Amazon S3 |
No |
Yes, integrate with Amazon S3 |
Yes, integrate with Amazon S3 |
Yes, integrate with Amazon S3 |
No | |
|
Outer proxy |
Yes, also available for free plans |
Yes |
No |
Yes |
No | |
|
IP rotation |
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Crawlers/tasks |
Free plan |
10 crawlers |
5 public projects |
– |
– |
– |
|
Standard plan |
100 crawlers |
20 private projects |
1 user, 10 agents |
1 worker |
Depending on the number of URLs extraction | |
|
Professional plan |
250 crawlers |
120 private projects |
2 users, 50 agents |
3 workers |
Depending on the number of URLs extraction | |
|
Enterprise/custom plan |
– |
Custom |
3 users, unlimited agents |
Custom |
Depending on the number of URLs extraction | |
|
Enterprise |
OCR – Optical Character Recognition |
No |
Yes, scrape text out of images |
Yes, scrape text out of document |
No |
Yes |
|
URL queries |
Free plan |
Unlimited |
200 |
– |
– |
– |
|
Standard |
Unlimited |
10,000 |
5000 (up to 25000) |
Unlimited URLs with limited scraping time |
5000 | |
|
Professional |
Unlimited |
Unlimited |
25000 (up to 125000) |
Unlimited URLs with limited scraping time |
250000 | |
|
Enterprise |
– |
Unlimited |
Starting from 100000 |
Unlimited URLs with limited scraping time |
1000000 | |
|
Support |
Response time |
Within 1 day |
1 day |
1 day |
1 day |
1 day |
|
Support system |
No |
Intercom |
Ticket |
No |
Intercom | |




