Web scraping is hard, as much as we want to claim it as simple click and fetch – this is not the whole truth. Well, think back in time, when we haven’t had visual web scrapers like Octoparse, Parsehub, or Mozenda, any persons lacking programming knowledge are held back from tech-intensive stuff like web scraping. Despite the time it takes to learn the software, we might come to appreciate more of what is offered by all these “intelligent” programs, that had made web scraping feasible for everyone.
Why is web scraping hard?
- Coding is not for everyone
Learning to code is interesting, but only if you are interested. For those that lack the drive or time to learn, it could post a real obstacle to getting data from the web.
- Not all websites are the same (apparently)
Sites change all the time, and maintaining scrapers may get very time-consuming and costly. While scraping ordinary HTML content may not be that hard, we know there is so much more than that. What about scraping from PDFs, CSVs, or Excels?
- Web pages are designed to interact with users in many innovative ways
Sites that are made of complicated Java Scripts and AJAX mechanisms (which happen to be most of the popular sites you know) are tricky to scrape. Also, sites that require login credentials to access the data or one that has data changed dynamically behind forms can create a serious headache for web scrapers.
- Anti-scraping mechanisms
With growing awareness of web scraping, straight-forward scraping can be easily detected as a bot and get blocked. Captcha or limited access often occurs with frequent visits within a short time. Tactics such as rotating user agents, altering IP addresses, and switching proxies are used to defeat common anti-scraping schemes. Moreover, adding page download delays or adding any human-likes navigating actions may also give the impression that “you are not a bot”.
- A “super” server is needed
Scraping a few pages and scraping at a scale (like millions of pages) are totally different stories. Scraping at a large scale will require a scalable system with I/O mechanism, distributed crawling, communication, task scheduling, checking for duplication, etc.
Learn more about what is web scraping if you are interested.
How does an “automatic” web scraper work?
Most, if not all, of the automatic web scrapers, work by deciphering the HTML structure of the webpage. By “telling” the scraper what you need with “drag” and “click”, the program proceeds to “guess” what data you may be after using various algorithms, then eventually fetch the target text, HTML, or URL from the webpage.
Should you consider using a web scraping tool?
There’s no perfect answer to this question. However, if you find yourself in any of the situations below you may want to check out what a scraping tool can do for you,
1) do not know how to code (and do not have the desire/time to dig deep)
2) comfortable using a computer program
3) have limited time/budget
4) looking to scrape from many websites (and the list changes)
5) want to scrape on a consistent basis
If you fit into one of the above, here are a couple of articles to help you find a scraping tool that best meets your needs.
Web scrapers to be “smarter”
The world is progressing and so are all the different web scraping tools. Here are a few worth noting changes that had recently been made to various scraping tools I know. It’s great to see how people are so encouraged to make web scraping easier and accessible to anyone.
Octoparse recently released a new beta version which introduced a brand new Template Mode for scraping using pre-built templates. Many popular sites like Amazon, Indeed, Booking, Trip Advisors, Twitter, YouTube and many more are covered. With the new Template mode, users are prompted to enter variables such as keyword and location, then the scraper will work itself out to collect data from the website. It is a pretty neat feature if there’s a template you want and I believe the Octoparse team is constantly adding new templates too.
Also included in the beta version is a new URL feature that enables:
- Adding up to 1 million URLs to any single task/crawler (compare to 20k URLs previously)
- Batch import URLs from local files or another task
- Generate URLs that follow a pre-defined pattern, a straightforward example will be one that only has the page number changes.
- If you have a job that was actually split into two, one for extracting URLs and another one for extracting specific data from those extracted URLs, in the new beta version you can now associate the two tasks directly without having to manually “transfer” the URLs from one task to another.
Mozenda hasn’t had any update in months, but the last update back in December of 2017, had introduced a new cookies store that aims to make scraping behind login more straightforward. Prior to this, there were also major feature upgrades such as in-line data compare and moving agent data. Other earlier updates such as request blockers and job sequencer can also make the scraping process more efficient.
With Dexi.io, the last update which happened more than 12 months ago featured a trigger feature that carries out actions based on whatever happens in your Dexi.io account. Though the update has been over a year now, however, if you have a complex job this may be worth to check out.
Import.io added two new features back in July. These are not major scraping features but can be extremely useful if you need it: webhooks and extractor tagging. With webhooks, you can now get notified in many third-party programs such as AWS, Zapier, or Google Cloud as soon as data is extracted for a job.
Extractor tagging enables extra tagging via API and it aims to make data integration and storage easier and more efficient. Just a month earlier, Import.io had made getting foreign data much easier by offering Country Based Extractor. You are now able to get data as if you are physically located in another country!
Examples of how web scraping is used
With new information being added to the web second by second, the possibilities are endless!
- Gather Real Estate listing (Zillow, Realtor.com)
- Collect leads information, such as emails and phones (Yelp, Yellowpages, etc. ）
- Scrape product information for competitive analysis (Amazon, eBay, etc.)
- Collect product reviews for sentiment analysis and brand management (Amazon, etc.)
- Crawl social media platforms (Facebook, Twitter, Instagram, etc.) for identifying trends and social mentions
- Collect data for various research topics
- Scrape product prices to build a pricing monitor (Amazon, eBay, etc.)
- Extract hotel data (Booking, Trip Advisor, etc.) and airline data to build aggregators
- Scrape job listings (Indeed, Glassdoor, etc.) to fuel job boards
- Scrape search results for SEO tracking
- Scrape physician data
- Scrape blogs and forums (content aggregation)
- Scrape any data for various marketing purposes
- Extract event listings
- And many more…
The next step?
Do you know how much data is being created every day? With our current pace, 2.5 quintillion bytes of data are created each day and over 90% of data was created in the last two years. To scrape or not to scrape, may sooner or later become the question for many as the volume of data increases at an unprecedented rate, and when the time has come to appreciate data-driven decisions more than ever. Technology is about making things “smarter” and easier for people, there should be no doubt the same thing will apply to the web scraping arena.