Incremental Extraction -- Get Updated Data with Clicks
Monday, September 12, 2016 9:43 PMFor the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.
Websites, such as news portals or forums, typically have new content added quickly if not dynamically. To stay up-to-date with such websites, Octoparse’s Incremental Extraction allows you to extract updated data much more effectively by skipping the pages that have already been extracted, and only scrape the new ones.
When to use Incremental Extraction?
1. When you need the latest data from any website frequently
2. When the new information shows up as new web pages with new URLs (as opposed to new information being added/updated to existing web pages)
So a perfect example would be CNN.com. Imagine if you need to get news feeds from CNN.com almost in real-time, it is important to schedule and run the task as frequently as needed so whatever gets added to the site can be extracted in a timely manner. Therefore, the above criteria 1 is met.
Obviously, each news article on CNN.com has a different URL that can be easily identified - so the above criteria 2 is also met.
Assuming you have a task set up for the job, it doesn't really make sense to re-scrape those articles which have already been captured in previous runs. Using Incremental Extraction, you can easily have the URLs checked first to make sure they have not been extracted already, and only capture the ones that are truly new.
How does Incremental Extraction identify "new" data?
Incremental Extraction will only work if the newly added data can be identified with new URLs. During the extraction process, Octoparse checks each URL to determine whether it is one that has been crawled before. If a URL is identified as one from the previous crawl, it will be skipped automatically when running with Incremental Extraction.
How to set up Incremental Extraction?
1. Go to task settings
2. Tick Enable incremental extraction
3. Select either Match the entire URL or Match by part of the URL
Match the entire URL
With this option, Octoparse will use the entire URL to match the current one. Even the slightest difference will have it identified as a "new" URL.
Match by part of the URL
In many cases, URLs are composed of various attributes, for example, the one for eBay below includes attributes "_from", "_trksid", "_nkw", and "sacat" (usually anything that comes before "=" sign).
When running with Incremental Extraction, Octoparse detects attributes automatically and makes them available as parameters. Having one or more attributes selected as parameters for the match, you are telling Octoparse to compare the current URL based on the selected attributes, if any of those are the same, skip it, otherwise, scrape the page.
Tip! Only tasks with one Extract Data action can enable Incremental Extraction as this function identifies the URL of the page where Extract Data action is executed. |
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.