Translate this website:
Search this website:


BC/DRCloud StorageComplianceData CentresDeduplicationDisk/RAID/Tape/SSDsEthernet StorageSAN/NASTiered StorageVirtualization

Dealing with DNA data ? doubling year on year

SNS-UK talks to Phil Butcher, Head of IT at the Wellcome Trust Sanger Institute, about the challenges of providing the computing and storage infrastructure to support work on the ongoing work of the Human Genome Project.

 

Date: 1 Mar 2010

SNS: Please provide a brief professional autobiography, with reference to how your job has evolved over the years

PB: I have worked in the IT industry for well over 25 years and have seen everything from IBM mainframes to DEC Vax clusters and the introduction of the first IBM PCs. I installed early ‘thick-wire' Ethernet and designed wide area global data repositories using dial-up long before the days of broadband. My experiences in IT have grown over the years having to deal with various challenges along the way. I have been involved in major building projects designing data centres along the way and incorporating new network topologies. I have built new teams to help me meet all those challenges. Today, as Head of IT at the Wellcome Trust Sanger Institute, my role involves a great deal of strategic thinking and forward planning while at the same time, keeping an eye on technology and working with my teams to devise and provide leading edge, high performance IT for Life Sciences. One thing I have learnt is that we should always expect change and plan future strategies to be extremely flexible and adaptable.

SNS: Please give some background as to the work carried out by the Sanger Institute

PB: I joined the Wellcome Trust Sanger Institute in 1993 after working as the IT manager for a commercial software development house. My role at Sanger was as the lone IT manager, which consisted of half a dozen Sparc workstations each with a 1GB hard disk. Our main projects at that time were to uncover the DNA sequence of yeast and the nematode worm.  This was prior to starting our work on the Human Genome Project. As the sequencing techniques became more production ready the institute, with the backing of The Wellcome Trust, MRC and others agreed to play a major role in the Human Genome Project. This was a ten year project to piece together a single human genome so that the sequence of the four base chemicals (Guanine, Cytosine, Thymine and Adenine) could be read in the correct order.  This was no mean feat since there are 3 billion pairs (3,000 million) of the base chemicals in every strand of DNA.

This would lead to a better understanding of how the human body worked and give an insight into the building blocks of life itself. In collaboration with 20 or so other institutes worldwide, the human genome project was completed by 2003, within the ten years. The Sanger Institute contributed around one-third of the entire genome data, completely free of charge, which was released to the public domain for researchers world-wide to work with.

SNS: Please give a flavour of the research projects carried out and their typical IT/ storage requirements

PB: Techniques in the science, in the analysis, data management and IT had improved so much that by 2005 Sanger and collaborators around the world had completed and published the DNA sequence for many genomes including mouse, human genomes, numerous pathogenic genomes (disease organisms such as tuberculosis, malaria) and launched into larger scale studies into cancer and malaria etc. Our scientific data stores doubled annually from 1993 until by 2005, when we had accrued around 300 Terabytes of data. This was the tip of the iceberg for things to come.

Today, the technology of DNA sequencing i.e. producing detailed readouts of whole DNA genomes, has shifted into a completely new paradigm and the Sanger Institute is now producing the equivalent of several completed genomes each day. We have projects in progress to sequence 1,000 individual humans and are getting ready to sequence many thousands of individual human genomes. Within the next five years there are plans for the institute to sequence tens of thousands of genomes. This will produce a staggering amount of data.

The aim is to understand how inherited DNA, environments, diet, lifestyle etc., affects the health of individuals and lead to improved medicines and ultimately improved global health care.

Our history shows that data acquisition has doubled every year for the last 10-15 years so the future holds huge challenges for us.

SNS: Please give some idea of the IT/storage infrastructure that underpins the research projects.

PB: Our IT infrastructure has evolved but is now undergoing a radical review. We organised our compute and storage around a Fibre Channel storage area network (SAN). Although this was expensive at the time, it has paid huge dividends for flexibility, which holds true today. We also have high end, open source Lustre filesystems to handle the high I/O required from the large blade compute farms. These are based on 1Gb and 10Gb/sec Ethernet networks. We have a few NFS storage arrays as well.

SNS: Are you dealing with a single site when it comes to users and IT/storage resources, or coping with a disparate user base and a grid computing infrastructure?

PB: We work from a single site, but we are replicating data to a remote colocation. We believe that the notion of onsite/offsite computing will become the norm so that we can reduce risk in a single data centre and ensure we are not limited by space and power.

SNS: Presumably storage requirement has grown into the Petabyte field – making some kind of a tiered storage approach crucial?

PB: We do have a tiered storage architecture although we don't have the optimum solutions for all the layers as yet. We are deploying a very cost effective layer using Nexsan storage as our warehouse area where scientific project groups are encouraged to check-point data onto the warehouse layer. We faithfully replicate this to a remote site. At this time we have approximately 1Petabyte of data replicated offsite.

SNS: Is it too simple to characterise this storage requirement as disk while the databases are running, and then straight to tape archive when the project is complete?

PB: It is evolving. At our scale I am less convinced that tape has a real future. We don't have time to back it all up and would have to wait years to restore it…Replication with snapshots for point in time saves seem to be adequate. We are not backing the very large data sets to tape at all hence the emphasis on disk replication and snapshots for point in time copies.

SNS: Can you give some idea as to how the IT/storage infrastructure has changed over the years?

PB: Our IT has grown from half a dozen Sparc stations in 1993 with a few gigabyte disks attached to being one of the largest IT operations in Life Sciences in the world. We now have more than 8,000 cores of high performance compute in total and have more than 7 Petabytes of raw disk in the data centre today. I am already discussing adding a further 1.6PB of storage in the next three months and we are thinking about the next wave of blade computing requirements.

SNS: Also, have the research compute demands driven development of IT or IT developments allowed more sophisticated research…which drives which?

PB: The scientific data output of the new generation sequencers is increasing dramatically. The DNA sequencers are evolving at about twenty times the speed of Moore's Law. IT is seriously lagging behind. IT has been the enabler to date but science is rapidly moving more towards being in-silico i.e. computer based as opposed to in-vitro i.e. in glass or in the lab. Science is driving the IT systems development extremely hard at the moment. We are definitely pushing the envelope.

SNS: Is there no better way of moving/sharing large amounts of data across a wide geographical area than what you refer to as ‘sneakernet'?

PB: For very large datasets, the internet is not fast. Dedicated pipes can improve transfer speeds but there are costs involved. The fastest and most cost effective way to transfer very large data sets today is via DHL or Fedex. As network bandwidth increases and prices of wide area links fall we hope to be able to use networks more readily. The high energy physics people have this under their belt but have deployed numerous international lightpaths. This takes significant government investment.

SNS: Does the advent of technologies such as virtualisation, deduplication and The Cloud offer any hope for the future regarding the data storage burden?

PB: We are aggressively deploying server virtualisation and reducing our power requirements and data centre footprint. We have been working closely with Amazon to explore Cloud Computing and have run complete sequencing pipelines in the Cloud environment. Work is continuing and we do see this as a good medium to distribute publicly available datasets globally provided we can get the data there in the first place.

SNS: What is the major pain point that you are looking to solve right now?

PB: Data management is the biggest concern. There are few systems available that can manage multi-Petabytes of data. We are always investigating ways to do this more efficiently.

SNS: Are there unique aspects to life sciences IT/storage infrastructure requirements, or are you able to benchmark and share knowledge with other research areas?

PB: We attend meetings to share information and learn from others, particularly in the high energy physics field. We share all of our experiences and trade information to help smaller labs avoid the pitfalls. We always welcome knowledge exchange.

SNS: What IT/storage technology not available right now, would make the biggest positive impact in helping you do your job?

PB: Very scalable cluster file systems that will be around for many years to come would help. Our informatics groups are investigating technologies such as Hadoop, which could be useful to analyse huge data sets – but not all data fits that model. We need Scalable cluster file systems with HSM capability that will scale to tens of Petabytes, include good data management, replication (local and WAN) and run in single namespace. Is this too much to ask?

SNS: Could you give some comparison as to the shrinking time windows over the years in terms of the net effect from research projects?

PB: Well as I alluded to earlier – a single genome took almost ten years to complete. The labs are producing many genomes worth of sequence data each day. Data storage has increased twenty fold in the last few years and computation has increased dramatically as well. What's more significant is that the rate of change is phenomenal and is set to increase many fold within the next year or two.

SNS: Do you have any specific objectives for the coming year?

PB: We need cost effective storage and compute. The credit crunch saw a brief increase in prices but these seem to be falling again. We are virtualising where possible to minimise hardware server procurement.

SNS: Please could you provide some detail as to the relationship with S3?

PB: S3 have been intimately involved with our organisation for many years. They have advised on tape and storage strategies and have introduced vendors such as Nexsan and Isilon to us. They are very competent and have a broad range of experience.

SNS: And how does working with the Channel bring value to you as opposed to dealing direct with vendors?

PB: We also work closely with vendors particularly if we are involved with product development but it is always very useful to have independent reseller like S3 to give a balanced view of products.

SNS: We've not mentioned the ‘thorny' issue of power – do you have any power/cooling issues?

PB: Yes – power is limited on the campus - hence the virtualisation program. We are installing a combined cooling, heating and power generation unit this year to add a further 1.9MW of capacity.

SNS: More specifically, is there a worry that you will run out of power at some stage?

PB: We must use the resources we have efficiently. We are fully aware of the issues of producing potentially life saving information while at the same time consuming the planet's resources. We have an aggressive program of virtualisation and consolidation to reduce IT footprint and make the best use of power where we can. We will have to move to the onsite/offsite model in the coming months.

About S3
Founded in 1988 S3 is a vendor independent, service led information management solution company.  S3 solutions are designed to deliver information integration, availability, compliance and security.  Its services include storage audit, information infrastructure design, solution delivery and implementation.  S3's reputation in the industry is built on ‘always on' availability backed up by commitment to customer support and problem ownership.

For more information visit www.s3.co.uk

ShareThis

« Previous article

Next article »

Tags: Tiered Storage, BC/DR, Cloud Storage, Compliance, Data Centres, Deduplication, Disk/RAID/Tape/SSDs, Ethernet Storage, SAN/NAS

Related White Papers

23 Nov 2011 | White Papers

Automated Storage Tiering on Infortrend’s ESVA Solution by Infortrend

This white paper introduces automated storage tiering on Infortrend’s ESVA storage solutions. Automated storage tiering can generate significant advant... Download white paper

29 Jul 2010 | White Papers

Seven Ways to Lower Storage TCO by Compellent

Organizations of all types and sizes are under pressure to boost efficiency, radically cut costs and reduce power consumption. When measured by total cost of ow... Download white paper

Read more White Papers»

Related News

20 May 2013 | BC/DR

20 May 2013 | BC/DR

17 May 2013 | BC/DR

17 May 2013 | Compliance

Read more News »
Related SNS UK TV & Audio

12 Dec 2011 | SAN/NAS

Panasas Parallel File System and RAID

In this video blog post, Garth Gibson, Panasas founder and CTO, talks about file system RAID and how the Panasas parallel file system, PanFS, provides redundancy as part of the file system itself. He compares this innovative approach to mos...

14 Oct 2011 | Deduplication

Introducing Quantum's DXi Accent: Maximizing Deduplication Efficiency [Part 2]

Get to know Quantum's DXi Accent software in Part 2 of our video blog introduction by Dan Duperron.

10 Oct 2011 | Deduplication

Introducing DXi Accent: Maximizing Deduplication Efficiency

Dan Duperron introduces DXi Accent - software designed to enhance your deduplication efficiency.

More SNS UK TV»

More Audio»

Related Web Exclusives

6 May 2013 | BC/DR

22 Apr 2013 | BC/DR

1 Apr 2013 | BC/DR

25 Mar 2013 | BC/DR

Read more Web Exclusives»

Related Magazine Articles

| BC/DR

October 2010 | Tiered Storage

September 2010 | Data Centres

  • SNW Preview

    In 2009, Angel Business Communications the organisers of SNW Europe, together with co-owners SNIA Europe made a strategic decision to expand the focus of the... Read more

September 2010 | Tiered Storage

Read more Magazine Articles»

Related Supplements

1 Oct 2008 | Virtualization

Discovering Business Continuity in a Virtualized Environment

At first, organisations saw VMware server virtualization mainly as a way to save money on their hardware and power budgets. Now though, innovative users have realised that virtualization can make vital contributions in many other ways as well - in particular, they are using it to improve application availability and enhance their disaster recovery capabilities.

Click here to learn more »

Read more Supplements »

Recruitment

Latest IT jobs from leading companies.

 

Click here for full listings»

Advertisement