We’re pleased to announce the release of Octoparse version 7.1.0!
This release introduces a brand-new feature, Task Templates with ready-to-use tasks for extracting different types of websites, such as Amazon, Yelp, Tripadvisor, etc., and also includes three major updates to the dashboard, URL input features, and anti-blocking settings.
New
· Task Templates
Octoparse’s new Task Templates are designed to make web scraping easier and more accessible for anyone. With pre-built task templates, there’s no need to configure the scraping tasks. The ready-to-use task templates will shorten your learning curve and help you quickly get on board.
– How does it facilitate scraping?
With Task Templates, anyone without/with little programming knowledge is able to achieve web scraping very easily. All you need to do is just to enter parameters (target page URL, keywords for searching, etc.). And then just sit back and relax!
1. Dozens of ready-to-use templates covering the most popular websites across different industries
2. Rich built-in data fields
3. Sample output preview
– How it works?
After selecting the desired template, you will be prompted to enter the required parameters, like the keywords to be searched through or the target URLs, then the scraper will work itself out to collect data from the website.
Updates
· Dashboard upgraded
Compared to Dashboard in version 7.0, the new Dashboard layout is more informative, customizable, and efficient.
In version 7.1, you could completely change the look of your dashboard and the display order of your tasks.
1. Customizable information columns
A selection of columns is provided for users to decide what task information you’d like to see.
2. Two default view modes
By default, tasks would be sorted by groups on the dashboard. By switching the view mode, you could sort the tasks based on the last executed time in descending
3. Efficient custom filters
With the upgraded custom filters, with very little effort you could have your own unique dashboard, or narrow it down to one single task/a specific cluster of tasks.
· URL input upgraded
We’ve expanded the input URL limits from 20,000 to 1,000,000 and also introduced two new input methods for large-scale data extraction projects.
1. Increased maximum input quota of URLs
The maximum number of URLs allowed to be input at once is significantly raised. Compared to 20k URLs previously, now Octoparse supports for adding up to 1 million URLs to any single task/crawler.
Tips: Please notice that the maximum number for the pasting-in method to input URLs is deduced to 10K.
2. Batch import URLs from files or another task
– Import URLs from files
In version 7.1, you could import a CSV, TXT, or Excel file, and Octoparse would intelligently read the URL data from the file.
– Import URLs from tasks
Two options are supported. One is simple import, importing URLs from a completed task directly; and the other is advanced import, “transferring” URLs from a parent task into a child task in associated running.
When two tasks are associated, Octoparse provides four execution options. For example, if you select “Run task as soon as its parent task starts”, then once Octoparse reads any URL extracted in the parent task, it would automatically transfers the URL into the child task and set the task to execute.
Tips:
1. Advanced import is only supported by Octoparse Cloud Extraction.
2. When there is no data extracted in the parent task, to start configuring the child task, you’ll need to manually paste in one URL.
3. Batch generate URLs based on a pre-defined pattern
This feature allows you to easily modify the needed parameter/s in one given URL so as to generate a list of URLs that follows that pattern.
Highlight the wanted parameter, click “Add parameter”, and select from the four options to define the pattern you need.
· Anti-blocking settings upgraded
We have added two options to help reduce the chance of getting blocked by scraping-sensitive websites. In version 7.1, now Octoparse could automatically switch UA and clear cookies for you.
1. Auto-switch browser (User agent)
2. Auto-clear cookies