undefined

Incremental Extraction -- Get Updated Data with Clicks

Monday, September 12, 2016 9:43 PM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

  

Websites, such as news portals or forums, typically have new content added quickly if not dynamically. To stay up-to-date with such websites, Octoparse’s Incremental Extraction allows you to extract updated data much more effectively by skipping the pages that have already been extracted, and only scrape the new ones.

 

When to use Incremental Extraction?

1. When you need the latest data from any website frequently

2. When the new information shows up as new web pages with new URLs (as opposed to new information being added/updated to existing web pages)

So a perfect example would be CNN.com. Imagine if you need to get news feeds from CNN.com almost in real-time, it is important to schedule and run the task as frequently as needed so whatever gets added to the site can be extracted in a timely manner. Therefore, the above criteria 1 is met.

Obviously, each news article on CNN.com has a different URL that can be easily identified - so the above criteria 2 is also met.

Assuming you have a task set up for the job, it doesn't really make sense to re-scrape those articles which have already been captured in previous runs. Using Incremental Extraction, you can easily have the URLs checked first to make sure they have not been extracted already, and only capture the ones that are truly new.

 

How does Incremental Extraction identify "new" data?

Incremental Extraction will only work if the newly added data can be identified with new URLs. During the extraction process, Octoparse checks each URL to determine whether it is one that has been crawled before. If a URL is identified as one from the previous crawl, it will be skipped automatically when running with Incremental Extraction. 

 

How to set up Incremental Extraction?

1. Go to task settings 

incremental1

2. Tick Enable incremental extraction

incremental2

3. Select either Match the entire URL or Match by part of the URL

incremental3

 

Match the entire URL

With this option, Octoparse will use the entire URL to match the current one. Even the slightest difference will have it identified as a "new" URL.

 

Match by part of the URL

In many cases, URLs are composed of various attributes, for example, the one for eBay below includes attributes "_from", "_trksid", "_nkw", and "sacat" (usually anything that comes before "=" sign).

incremental4

When running with Incremental Extraction, Octoparse detects attributes automatically and makes them available as parameters. Having one or more attributes selected as parameters for the match, you are telling Octoparse to compare the current URL based on the selected attributes, if any of those are the same, skip it, otherwise, scrape the page.

 

Tip!

Only tasks with one Extract Data action can enable Incremental Extraction as this function identifies the URL of the page where Extract Data action is executed.

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline