Web scraping tools (also called data extraction tools or web scrapers) help you collect data from websites and store them on your local database or spreadsheets. There are a lot of web scraping tools on the market. Before choosing the right web scraping tool for your business, it’s important to know what each tool provides. I have a very comprehensive comparison chart for the top 5 web scraping tools – Octoparse, Parsehub, Mozenda, Dexi.io, and Import.io.
Overview
Here I will give a brief introduction on these 5 web scraping tools.
Characteristics |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
Usability |
★★★★★ |
★★★★☆ | ★★★★★ | ★★★★★ | ★★★★★ |
Functionality |
★★★★☆ |
★★★★☆ |
★★★★☆ |
★★★★☆ |
★★★★☆ |
Easy to learn |
★★★★★ |
★★★★☆ |
★★★★★ |
★★★★☆ |
★★★★★ |
Customer support |
Email, phone, training, community support |
Email, live chat, forum |
Phone, email, video chat |
Email, phone, community support |
Email, training, chatbot, community support |
Price |
$19-Basic, $89-Standard, $249-Professional |
$149-Standard, $499-Professional |
Starting from $100 per 5000 pages |
$119-Standard, $399-Professional, $699-Corporate |
$299-Essential, $4999-Enterprise annual, $9999-Premium annual |
Trial version/Free version |
Free version- free, |
Free version-free |
30 days trial |
Trial |
7 days trial |
OS (Specifications) |
Win |
Win, Mac, Linux |
Win |
Win, Mac, Linux |
Win, Mac, Linux |
Data Export formats |
Txt, CSV, Excel, databases(MySql, SqlServer, Oracle) |
CSV, JSON |
CSV, TSV, XML, Excel and JSON. |
CSV, Excel, XML, JSON, Zip |
CSV, JSON, Google sheets |
Multi-thread |
Yes |
Yes |
Yes |
Yes |
No |
API |
Yes |
Yes |
Yes, (specific API) |
Yes |
Yes |
Scheduling |
Yes |
Yes |
Yes |
Yes |
Yes |
Build a crawler
The crawler is tasked with scraping data from usually one website with unlimited/limited Page/URL inquiries. Here I will list the most important features when scraping data online.
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
Built-in browser |
Yes |
Yes |
Yes |
No |
Yes | |
Keyboard shortcuts |
No |
Yes |
No |
Yes |
Yes | |
Pagination |
Next button |
Yes |
Yes |
Yes |
Yes |
Yes |
Load more |
Yes |
Yes |
Yes |
Yes |
Yes | |
Numbers |
Yes |
Yes |
Yes |
Yes |
Yes | |
Infinite scrolling |
Yes, support infinite scrolling times customized setting |
Yes |
Yes |
Yes |
Yes | |
Enter Text |
Various keywords |
Yes |
Yes |
Yes |
Yes |
Yes |
Combine keywords from two lists |
No |
No |
No |
No | ||
Date Inputs |
No |
Yes |
No |
No | ||
“In tandem” loop |
No |
No |
No |
No | ||
Enter a list of URL/keywords |
A list of URLs |
Yes |
Yes |
Yes |
Yes |
Yes |
JSON |
No |
No |
No |
No | ||
Update URL lists |
No |
No |
No |
Yes | ||
Input document |
No |
Yes, support CSV input |
Yes, support CSV input |
No | ||
Select data |
Selecting elements |
Point-and-click, XPath |
Point-and-click, XPath, CSS |
Point-and-click, XPath |
Point-and-click, CSS |
Point-and-click, XPath |
Data formats |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. |
Text, HTML, url, etc. | |
Transforming data |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression |
Yes, via Regular Expression | |
Use scraped data as an input in one project |
No |
Yes, scrape data from a website and use it as an input for another website to scrape data |
No |
No |
No | |
Customized serial number |
No |
Yes, adding a number variable that increments on each iteration |
No |
No |
No | |
Different kinds of date strings |
Yes |
Yes |
No |
Yes | ||
Crawlers/Tasks switch |
Yes, support multi-thread operation |
Yes, support switching to another crawler when configuring a crawler |
No |
Yes |
No |
Extract data
Some advanced features needed when extracting data:
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
Scraping Mode |
Local running and cloud running with Octoparse servers |
Cloud running |
Cloud running |
Cloud running |
Cloud running |
Visual Mode |
Yes |
No |
Yes |
No |
Yes |
Test Run |
No |
Yes, up to 5 pages |
Yes |
Yes |
No |
Extract behind a login |
Yes |
Yes |
Yes |
Yes |
Yes |
Scheduling |
Yes, support scheduling tasks in real-time/daily/weekly/monthly |
Yes |
Yes |
Yes, support choosing the local timezone |
Yes |
IP Rotation | Yes, support IP rotation and choose a different geo-location before running a task |
Yes |
Yes, support IP rotation and choosing different geo-location before running a task |
Yes |
Yes |
Solving Captcha |
Yes, only available for local running |
Yes, only available for Text Input Captcha |
No |
Yes, need to integrate with third-party Captcha solving platform |
No |
Error report/debug |
Yes, missing data error report |
No |
Yes, error troubleshoot reminder |
Yes, provide screenshots, error message, debug mode and the execution log |
No |
Get data
Features on how to get data:
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
Extraction Speed or cloud servers distribution |
Free plan |
No cloud servers, depending on the local network |
1 worker (approx. 5 pages/minute) |
– |
– |
– |
Standard plan |
6 cloud workers, depending on the rule of the crawlers |
4 workers (approx. 20 pages/minute) |
Depending on paid page credits |
1 worker |
Depending on the number of URLs extraction | |
Professional plan |
20 cloud servers, depending on the rule of the crawlers |
24 workers (approx. 120 pages/minute) |
Depending on paid page credits |
3 workers |
Depending on the number of URLs extraction | |
Concurrent running crawlers |
Free plan |
2 for local running |
1 |
– |
– |
– |
Standard plan |
Unlimited for local running, 6 for cloud running |
4 |
2 |
1 |
Depending on the number of URLs extraction | |
Professional plan |
Unlimited for local running, 20 for cloud running |
24 |
5 |
3 |
Depending on the number of URLs extraction | |
Customized servers |
No |
Yes, servers could be distributed manually |
No |
Yes |
No | |
Local running |
Yes |
Only available with test Run |
Only available with Test Run |
No |
No | |
Cloud Running |
Yes, different paid plans have different cloud extraction based on the cloud servers |
Yes, different paid plans have different cloud extraction speed |
Yes, different paid page credits have different cloud extraction speed |
Yes, depending on different workers or robots |
Yes | |
Notification on task completion |
No |
Yes, email |
Yes, email |
Yes, message |
Yes, email |
Data export and data storage
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io |
Data export | |||||
API |
Yes |
Yes |
Yes |
Yes |
Yes |
CSV |
Yes |
Yes |
Yes |
Yes |
Yes |
JSON |
No |
Yes |
Yes |
Yes |
No |
Google Sheet |
No |
Yes, with API |
Yes |
Yes |
Yes |
Tableau |
No |
Yes, integrated with Tableau |
No |
No |
Yes |
Web |
No |
Yes |
No |
Yes |
No |
Data storage | |||||
Free |
No, need to export the data to your own machine |
14 days |
– |
– |
– |
Standard |
3 months |
14 days |
1 GB storage |
No |
No |
Professional |
3 months |
30 days |
5 GB Storage |
No |
No |
Enterprise |
– |
30 days |
50GB Storage |
No |
No |
Solutions
Web scraping tools are used to scrape different kinds of websites. Here I list some typical websites that most people concern.
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
Job |
|
No, easy to be detected and banned by LinkedIn anti-web scraping techniques |
No |
No |
No |
No |
Glassdoor |
Yes |
Yes |
Yes |
Yes |
Yes | |
SNS |
|
Yes |
No |
No |
No |
No |
|
Yes |
Yes |
Yes |
Yes |
Yes | |
|
Yes |
Yes |
Yes |
Yes |
Yes | |
Real estate |
Airbnb | No, the updated website is not compatible with Octoparse built-in browser |
No |
No |
No |
No |
Booking |
Yes |
Yes |
Yes |
Yes |
Yes | |
Realtor.com |
Yes |
Yes |
Yes |
Yes |
Yes | |
Tripadvisor |
Yes |
Yes |
Yes |
Yes |
Yes | |
Product details |
Yellowpages |
Yes |
Yes |
Yes |
Yes |
Yes |
Yelp |
Yes |
Yes |
Yes |
Yes |
Yes | |
Amazon |
Yes |
Yes |
Yes |
Yes |
Yes | |
e Bay |
Yes |
Yes |
Yes |
Yes |
Yes | |
Maps |
Google Maps(latitude and longitude data) |
Yes |
No |
No |
Yes | |
Others |
|
Yes |
No |
No |
No |
No |
Premium Plans and support
Content |
Octoparse |
Parsehub |
Mozenda |
Dexi.io |
Import.io | |
All Paid Plans |
Download Images and Files to Dropbox |
No |
Yes, integrate with Dropbox |
Yes, integrate with Dropbox |
Yes, integrate with Dropbox |
No |
Download Images and Files to Amazon S3 |
No |
Yes, integrate with Amazon S3 |
Yes, integrate with Amazon S3 |
Yes, integrate with Amazon S3 |
No | |
Outer proxy |
Yes, also available for free plans |
Yes |
No |
Yes |
No | |
IP rotation |
Yes |
Yes |
Yes |
Yes |
Yes | |
Crawlers/tasks |
Free plan |
10 crawlers |
5 public projects |
– |
– |
– |
Standard plan |
100 crawlers |
20 private projects |
1 user, 10 agents |
1 worker |
Depending on the number of URLs extraction | |
Professional plan |
250 crawlers |
120 private projects |
2 users, 50 agents |
3 workers |
Depending on the number of URLs extraction | |
Enterprise/custom plan |
– |
Custom |
3 users, unlimited agents |
Custom |
Depending on the number of URLs extraction | |
Enterprise |
OCR – Optical Character Recognition |
No |
Yes, scrape text out of images |
Yes, scrape text out of document |
No |
Yes |
URL queries |
Free plan |
Unlimited |
200 |
– |
– |
– |
Standard |
Unlimited |
10,000 |
5000 (up to 25000) |
Unlimited URLs with limited scraping time |
5000 | |
Professional |
Unlimited |
Unlimited |
25000 (up to 125000) |
Unlimited URLs with limited scraping time |
250000 | |
Enterprise |
– |
Unlimited |
Starting from 100000 |
Unlimited URLs with limited scraping time |
1000000 | |
Support |
Response time |
Within 1 day |
1 day |
1 day |
1 day |
1 day |
Support system |
No |
Intercom |
Ticket |
No |
Intercom |