Translate this website:
Search this website:


BC/DRCloud StorageComplianceData CentresDeduplicationDisk/RAID/Tape/SSDsEthernet StorageSAN/NASTiered StorageVirtualization

Where to begin with Deduplication

De-duplication in itself is easy to understand ? optimised storage capacity usage by eliminating duplicated data. However the devil is in understanding the different technologies, techniques and implementations in the market and relating these to customers specific needs. By David Galton-Fenzi, Zycko's Group Sales Director.

 

Date: 1 Oct 2008

Instead of storing data multiple times, de-duplication enables the data to be stored once and uses that single instance as a reference.  The techniques used to do this vary.  For instance, we could look for complete files which are the same, and only when these are a complete match with each other, is a single instance created.  Alternatively we could look at files which are basically similar (for example revisions of a draft document) and create a single instance of a master file only saving the byte level differences between this and subsequent files.  So which of these approaches is best?  As always, the answer is not straightforward.

If we look at the first of these ? working at a file level, rather than a byte level, there are well established techniques such as CAS ? Content Addressable Storage.  With this approach the contents of the file are put through a mathematical mincer and the end product is a unique identifier which is attached to the file.  If exactly the same file exists somewhere else in the system, the mathematical mincer will produce exactly the same identifier ? indicating a duplicate file which can be made into a single instance.  Using this approach, every time a spelling mistake is corrected, or punctuation is added to a document, a new identifier would be created and both versions of the document stored.  The result is that where files are constantly changing, the saving in storage capacity that can be achieved with CAS is fairly minimal.  So why do we have it at all?  The answer is ? for archives. When a file is archived it is normally for long term storage and is likely only to be referenced rather than changed.  After all, changing the archives is like re-writing history.  This aspect of CAS is also a way to ensure that archived records are not tampered with (as might be a temptation in a company facing significant legislative challenge) as any change will produce a new identifier and will be seen as a changed file.

This is where the second technique, byte level deduplication, comes in.  At this level the mathematical mincer changes, and this time it is looking for differences between files at a byte level.  Going back to our previous example of a document where a spelling or punctuation change has been made, the byte level deduplication would recognise and store only the minor changes that have been made to the original document.  This is an effective approach to minimising the storage capacity consumed, but does not give change tracking such as CAS delivers.  However, where ?live' data is being used, this approach is far and away the most effective for an enterprise environment, but the challenge is that it consumes much more processing power to achieve.

On the face of it, this would go a long way towards saving expensive primary storage capacity.  However, the reality is that in most primary storage environments the emphasis is on performance rather than saving disk capacity and any performance overhead (such as the mathematics to determine duplication) are seen an inhibitor to speed of delivery.  Additionally, the lifecycle of primary data can be fleeting (minutes or even seconds) so going through deduplication may be an unnecessary process.  As a result, today, with a few evolving exceptions, byte level deduplication is aimed at the backup environment.

Another key option to consider is where in the data centre we implement deduplication?  This doesn't sound too important, but it is a raging argument among the vendors in this part of the industry.

Some approaches have implemented deduplication for backup with a software ?agent' loaded onto each application processor which undertakes backup.  This spreads the load of the deduplication processing requirement across the processing power of all the servers involved ? but crucially must interact correctly and effectively with the existing backup software packages loaded onto the servers.  The upside of this deduplication implementation at source is that the process is completed before any data is sent to the storage devices, minimising the data transfers between server and storage. The downside, is that encountered by any agent based strategy, the agent must stay compatible with server software.  This means that any software upgrade or change on any server creates a potential for incompatibility and adds to the management task for the server administrators.

The alternative approach is to have a dedicated platform in the backup path which handle deduplication ?on the fly'.  This effectively centralises the process.  The benefits here are that the platform, not the servers, delivers the processing power for the deduplication and because it requires no changes to the server software, it is effectively transparent to the user.  Some storage vendors are taking up the idea of embedding these functions in their storage devices ? though none appear to exist yet.  In many ways this endorses the in-line platform as the most elegant solution, because all they are doing is maintaining the in-line dedicated platform, but locating it in the storage device.

Whichever approach eventually becomes the dominant implementation, as the data deluge continues to accelerate, deduplication will rapidly become a core element of any data centre's storage strategy.  It is not only the storage capacity savings that are attractive, but also the support deduplication can offer for compliance (only one instance of a file makes it easier to manage, protect and delete as required) that will continue to drive this market.

ShareThis

« Previous article

Next article »

Tags: Cloud Storage, Compliance, Deduplication, Disk/RAID/Tape/SSDs, Tiered Storage, BC/DR

Related White Papers

24 Jun 2010 | White Papers

Confidently Migrate Mission-Critical Applications to a Virtualized Environment by Dell

Download this article to discover how deploying vSphere™ 4 virtualization with 11th-generation Dell™ PowerEdge™ servers and Dell™ EqualL... Download white paper

24 Jun 2010 | White Papers

Designing Scalable Storage for Virtualized Microsoft Exchange Server 2007 Environments by Dell

Download this article to discover the best-practices approach of combining server, storage and network sizing tools Download white paper

24 Jun 2010 | White Papers

Streamlining Storage Management with Virtualization by Dell EMC

Streamlining Storage Management with Virtualization-Aware EMC Navisphere Manager Download white paper

Read more White Papers»

Related News

23 May 2013 | Cloud Storage

22 May 2013 | Cloud Storage

  • Dell acquires Enstratius

    Dell has announced the acquisition of Enstratius, an award-winning enterprise cloud-management software and services provider that delivers single and multi-... Read more

22 May 2013 | BC/DR

20 May 2013 | BC/DR

Read more News »
Related SNS UK TV & Audio

23 Jan 2012 | Cloud Storage

Powered By Asigra: Gregory Tellone, Continuity Centers

Gregory Tellone describes some of the benefits of partnering with Asigra - The fully managed online backup solution.

More SNS UK TV»

More Audio»

Related Web Exclusives

18 Mar 2013 | Cloud Storage

11 Feb 2013 | BC/DR

  • A look into the future

    Now that 2012 is nearly over, I guess it’s time to start looking at what’s coming down the track in 2013. Here are my top five predictions for th... Read more

4 Feb 2013 | BC/DR

4 Feb 2013 | BC/DR

Read more Web Exclusives»

Related Magazine Articles

October 2010 | Cloud Storage

  • The waiting is over!

    Don’t Miss SNW Europe, Datacenter Technologies and Virtualization World, 26th and 27th October 2010, Congress Frankfurt; where can you meet over 70 org... Read more

September 2010 | Data Centres

  • SNW Preview

    In 2009, Angel Business Communications the organisers of SNW Europe, together with co-owners SNIA Europe made a strategic decision to expand the focus of the... Read more

September 2010 | Virtualization

September 2010 | Cloud Storage

Read more Magazine Articles»

Related Supplements

1 Jun 2009 | Data Centres

Sharpen Your Business

It might be stretching the point to compare the present state of the IT industry with either Charles Dickens? revolutionary-era France, or the Renaissance, but there?s no doubting that the current global economic turmoil is a great opportunity for UK businesses to innovate. For far too long now, many have been content to simply throw more disks at their storage problem; continued to invest in expensive solutions, with after-sales contracts to match, because ?they always have?; and employed muddled thinking when it comes to CAPEX- and OPEX-related decisions.

Click here to learn more »

1 Oct 2008 | Virtualization

Discovering Business Continuity in a Virtualized Environment

At first, organisations saw VMware server virtualization mainly as a way to save money on their hardware and power budgets. Now though, innovative users have realised that virtualization can make vital contributions in many other ways as well - in particular, they are using it to improve application availability and enhance their disaster recovery capabilities.

Click here to learn more »

Read more Supplements »

Recruitment

Latest IT jobs from leading companies.

 

Click here for full listings»