Compulsive hoarding (or pathological collecting) is a pattern of behavior that is characterized by the excessive acquisition and inability or unwillingness to discard large quantities of objects that would seemingly qualify as useless or without value.
When it comes to data, it would appear that we are all hoarders.
We often read tabloid articles reporting on the likes of Mrs. Smith, who at the tender age of 75 hadn’t thrown anything away for the past 20 years. The “useful items” that she’d been storing (which she insists will be put to good use at a later date), now entirely fills her home, meaning every day tasks take umpteen times longer and important items are always almost impossible to find. Ultimately the local council or similar public body gets involved and tries to return some measure of normality by clearing and disposing of the worthless clutter.
I often wonder, that if we could physically see data, in similar terms to the newspapers and empty tins that Mrs. Smith stores, whether there would be a difference in how people store and actively make use of their hoarded data. Whether people would realize that storing and systematically protecting a file, created in 2008, that would likely never be read or reviewed again, is a wise thing to do.
Speaking to various organizations, one consistent pain point is evident – the
backup and recovery of this accumulated data. Whether it be an ever reducing backup window, or the shear scale of the amount of content that requires protection, one thing is clear; we can’t continue down this well trodden path. We won’t likely wake up tomorrow and decide to thin or weed out content from our file shares, we won’t decide to delete all emails older than 2009. We’re all hoarders.
There have been various advances in backup technologies that are aimed at reducing backup & recovery pain points - centralizing the storage and backup infrastructure, faster and larger capacity tape devices and tape emulating disk solutions (virtualized tape libraries or disk to disk to tape solutions). Backup and recovery software vendors also try to attack this problem in various ways. Incremental forever solutions that can potentially solve the problem of a reducing backup window (it only backs up content that has changed) but can hugely complicate the recovery process. The traditional master and media server solutions may resolve the issue of backing up huge amounts of content, though they can prove difficult to manage and costly in terms of licensing and support. Ultimately we’re not addressing the root cause of the problem, which is the explosion in accumulated data. We don’t delete data, so the content will always require storage and protection. We’re constantly playing catch up, deploying ever increasing (and costly) backup and recovery solutions to solve a problem that will likely never go away.
Looking further up the stack – moving away from backup and recovery, to the storage layer, there are technologies that can help. Snapshot and cloning solutions provide the ability to quickly and easily create copies of data at points in time, allowing users to recover their own data. On the flip side, snapshots are often stored on the same physical spindles as the live data, ultimately meaning a 3rd copy (ideally on tape) should be sought. Data De-duplication and compression technologies that help to thin out the data being stored, meaning backup windows can be shrunk, though these efforts are usual undone when the data is backed up. More interestingly a relatively old methodology is once again being used, a technology offering intelligent data tiering capabilities that help to apply a value to data being stored, thus enabling the Administrator to determine whether content actually requires such vigorous protection methods.
Did you know that in a “typical organization” roughly 75% of the files that are stored on “Tier 1” storage repositories (e.g. costly Fiber Channel or Serial Attached SCSI drives) hasn’t actually been read, reviewed or used in 6 months? Of that 75% of data, a massive 65% hasn’t even been accessed in over a year. Why then, is this content treated the same as business critical data. Content, which is never read and provides little or no value to the business. In fact, storing the old content and religiously backing it up actually equates to increasing operating expenditure. Conversely, data, which is new or is consistently accessed, has a high value to the business – content which is mission critical and if lost, would result in reduced revenue or cost many man hours to recover or recreate. And yet today, most corporates feverously backup all the data as one, with no regard for the true value of the data.
To start to solve the problem of backup and recovery we must first understand one important point. Not all data is equal. Some data holds more value than other data.
Hierarchical Storage Management (HSM) is nearly as old as the problem of backup and recovery itself. Originally designed for use in Mainframe environments, the technology allowed for the migration of old or stale content onto tape, thus freeing up the costly hard disks to store newer and more valuable data. As the capacity of hard disks increased over time, the purchase cost were driven further down, hard disk were then seen as commodity items, something that could be discarded when newer, higher capacity drives were made available. The ability to store larger amounts of data online (as opposed to offline on tape), spelt the death knell for HSM. But alas, the trend for ever increasing capacity and ever reducing costs will unlikely last forever, and once again the industry is turning to HSM as a means to reduce cost and waste.
Many storage guru’s will fondly remember the concept of Information Lifecycle Management (ILM). Conceived in around 2004, It promised the ability to easily move data between different tiers of storage, automatically with little or no administrator over head, completely dependent on the value of data. The mantra for ILM being “the right data stored on the right place at the right time”. Whilst it was a great idea in principle, no real technology or solutions existed at the time that was able to manage the transparent migration of data between the tiers. Potential customers were worried (for good reason) that the recovery of migrated content would be complex, or that their users would be confused by new icons, stating that their files had been moved or migrated. The industry forgot the benefits of applying a value to data and moved onto the next big thing.
Technology as always, has advanced and has more recently surpassed the capabilities offered by the historic HSM and ILM type solutions. Now providing the ability to stage data onto faster media (Serial ATA hard disk drives), instead of tape devices, in a completely transparent and non-disruptive fashion. Whereby historically, users would notice a distinct speed and/or user experience difference in accessing content that had been archived (to tape for example with HSM), the latest data tiering technologies means that users are unaware that their data is now being stored of cheaper and more appropriate storage. The migration is virtualized, and thus transparent or invisible from users and applications alike.
The newer Data Migrator technologies use stubs and pointers in much the same way as the older HSM solutions worked, allowing the storage administrators to migrate specific content types, dependent on policies, to several different locations. Be it a cheaper, more cost effective and higher capacity Serial ATA drives, older and less utilized centralized storage devices (NAS) or even object stores. The migrated data, can (in most instances) still benefit from local Disaster Recovery mechanisms such as snapshots and cloning as well.
Now the stale and less valuable data has been migrated, what’s next? Well here’s the cookie; the migrated data, deemed stale or old by the migration policy, can be removed from the backup lifecycle. Instead of a full backup consisting of 20 TB’s of data (and remembering that typically 75% of the content stored, will not have been accessed in 6 months), the full backup will be reduced to 5 TB’s, leading to huge savings in time and money, spent not only backup, but also on the continual accumulation of ever larger disk storage capacities for Tier 1 (FC and SAS). Agreed, the migrated content would still need to be backed up separately, though given its apparent value (it hadn’t been accessed in 6 months) a much less regular protection level could be applied. Alternatively, if the migration tier resided on an Object Store solution, you could even argue that given the self -healing, error correction and versioning capabilities of the Object Store, the data would never need to be backed up again.
So, maybe I don’t have to start thinning out my file stores quite yet, the data is valuable after all? My expense report from 2008 may just become useful again. I’ll be able to search and index all my hotel bookings and that’ll help present a case for a better corporate rate. Even better, it’ll help me keep track of all my loyalty points.
I’ll admit it. I’m a hoarder, but it’s only because I can’t throw away data. It could be useful one day.
Tags: BC/DR, Compliance, Deduplication, Disk/RAID/Tape/SSDs, Ethernet Storage, SAN/NAS, Tiered Storage