Introducing Octoparse Version 7.1 - web scraping for dummies is official!Sunday, November 18, 2018
Throughout the years of working in the data industry, the Octoparse team had never slowed down its pace in making data more accessible and ready to all people. It’s rooted in our belief that in the era of big data, anyone should be blessed with the capability to collect data so as to harness the power of big data.
Yet, despite the improved usability of our program, the thorough step-by-step training resources and even with such a friendly bunch of supports we have at Octoparse, there are a still a number of people feeling hesitated to use it due to limited time and efforts. This November, we are extremely excited to introduce the release the Version 7.1 [download here ] which include one of the most revolutionary moves in years - Template Mode Scraping.
What makes Template Mode Scraping so special?
If you have ever wondered about the level of technical proficiencies required to build a web scraper? The answer is “None” with the newly launched Template Mode Scraping. More specifically, now there are about dozens of built-in templates within the program and all ready to be used to fetch data instantly, with nearly zero learning curve!
Many popular sites like Amazon, Indeed, Booking, Trip Advisors, Twitters, YouTube, Yellowpage, Walmart, Realtor and many more are covered at this moment. And the best part is if you feel any website should be added, simply tell us and we’ll seriously consider having a template created for the site.
Who is this for?
Anyone! Yes, anyone that wants to get data fast and easy. If we already have a template you need, that's great! if not, let us know.
Template Mode Scraping can be especially valuable to anyone that needs to extract data from some of the most popular websites out there and maybe those that would prefer to skip the learning and does not require a high level of data customization.
How is it different from the old Wizard Mode Scraping?
If you are not new to Octoparse, you may have already tried our old Wizard Mode Scrapers. In fact, the new Template Mode Scraping and Wizard Mode Scraping are completely different. The old Wizard Mode works for a few specific page structures while the Template Mode scrapers are pre-built scrapers that extract pre-defined data fields from specific websites. In contrary to the Wizard Mode which users are required to correctly identify the proper webpage structure and tell Octoparse what data fields need to be captured, the Template Scrapers take over all the heavy lifting so all you have to do is tell Octoparse your search criteria, i.e. iPhone then click “start” to get data.
How to use it?
Step 1. Select “Task Templates” from the home screen
Step 2. Pick a template
Step 3. Check the pre-defined data fields and parameters
Step 4. Select “Use Template"
Step 5. Enter the variable for the parameters, such as “iPhone” for the search keyword
Step 6. Save the template and run
And there are more upgrades...
Not to leave behind Octopuses’ commitment to large-scale scraping of even the most complex/difficult websites, the new release also included features focusing on more efficient, effective and powerful data scraping.
1. Million-level URLs Input
Did you hate it when you can only input 20,000 URLs to any crawling task? We did so we’ve changed it. Now, you can add up to 1 million URLs to any tasks. Better yet, import the list of URLs from local files (txt, csv or xls) or from another task directly. You can even associate two running tasks by having one extract the URLs and the second one to fetch additional data from each individual URL extracted. In short, you can now associate the two tasks directly without having to manually “transfer” the URLs from one task to another.
Moreover, the new URL Generator feature enables “generating” URL list based on specific patterns. A straightforward example will be one that only has the page number changes.
Possible user cases include:
- Scraping massive products from E-commercial sites. Getting product URLs and product details separately can greatly improve the efficiency and consistency of the scrapes, at the same time, also reduces the chance of getting blocked and missing data.
- Scraping sites that block easily. Tasks running on a list of URLs can be assigned to run on various servers and thus better leverage IP resources to avoid getting banned.
- Scraping from a large number of different pages from a particular website. Use the URL generator to quickly generate all the page URLs and scrape all the pages simultaneously. No need to go through the pages one by one.
2. Improved Dashboard
Compared to the Dashboard in version 7.0, the improved Dashboard layout is more informative, customizable and efficient.
The new version offers two kinds of dashboard layouts to choose from based on your preference (arrange tasks by date created or by task groups). Also, choose what task information you would like to see in the dashboard, including scraping status, time used, number of runs, next run (if scheduled), scraping completion time.
3. Upgraded Anti-blocking mechanism
- Auto switch browser (User-agent)
- Auto clear cookies
Two more anti-scraping options have been added to help reduce the chance of getting blocked by scraping-sensitive websites. In version 7.1, Octoparse can automatically switch UA and clear cookies for you.
Need more details? Check out the official post What's New in Octoparse 7.1.
The Next Step…
Octoparse is always working to bringing you more accessible scraping experience. There are two things we care the most: ease of use and robustness. Please share with us how you find the new features or what templates do you need. We’ll love to hear your feedback!
If you are interested in knowing more about web scraping and Octoparse, here are some articles you may want to check out：
- Web Scraping for non-Developers
- How to Simplify Your Approach to Web Scraping
- Web Scraping Introduction
- Extract Data from Dynamic Websites in Real Time