The emergence of genomics and advanced gene sequencing techniques has made the collection and storage of data a centerpiece of biomedical research. As the data generated in biomedical research becomes richer and richer, having the infrastructure in place to deal with data growth efficiently is going to be a cornerstone of biomedical data management. This white paper examines a joint solution that features data reduction technologies combined with a network-attached storage system that offers storage optimization capacities along with an affordable, manageable, and scalable petabyte-ready storage platform.
Coping with the Explosion of Data
in Life Sciences Research
How storage is managed will either advance-
or slow-the pace of biomedical discovery
www.ocarinanetworks.com1
The emergence of genomics and advanced gene sequencing techniques has made the collection and storage of data a centerpiece of biomedical research. It was not that long ago that the human genome project first sequenced the human genome, as part of an international cooperative research effort. That first sequence took up about 750 gigabytes in 2000 - an amount of data that would fit on a single disk today. However, genomics research has rapidly moved past the first basic sequencing of the human genome, and now research advances are made through increasingly sophisticated sequencing machines and technologies. Today, research institutions, universities, pharmaceutical companies and even hospitals generate genomic data almost continually. A modern lab might generate as much as 10 terabytes a day of data.
For example, Cornell University's computational biology service unit, which supports life sciences across its many research facilities and hospitals throughout New York State, often collects as much as a terabyte a day from each of its many sources. Putting the data onto tape backups is not ideal, as many researchers need immediate, fast access to a "hot copy" of the gene sequencing data they are analyzing.
"As scientific researchers acquire data at faster and faster rates, optimizing the analysis of that data with scalable storage solutions is essential," said Dr. David Lifka, Cornell Center for Advanced Computing director. "Despite advances in disk technology, storing research data remains an expensive proposition," he explained. "Ocarina provides a cost-effective way to maximize storage capacity without sacrificing performance."
This is a field where technology is advancing very quickly, and the next generation of machines from leading companies like Illumina and Affymetrix will generate even richer data - and require even more storage to hold that data. Because knowledge in the field is moving so rapidly, the value in the data may not now be completely understood - so keeping the data long term for analysis could hold great value for research. However, as the amount of data generated grows, the burden on biomedical researchers to capture it and store it puts them at the center of a problem facing many parts of IT - coping with massive data growth.
For the most part, life sciences researchers are not storage experts, nor do they have a long history of running the world's largest data stores. They are being put in this position by the fast increase in the amount of rich data being generated by gene sequencers, ChIP sequencers, and other advanced technology. What's daunting is that the next generation of analyzers, sequencers and other genomics technology will generate even more data.
In fact, storage is such a crucial piece of the puzzle that it is entirely possible that the pace of genomics research will be slowed by the inability of researchers to deal with the onslaught of data. This could mean a slowdown in finding cures and treatments to the world's most pressing medical crises, such as cancer, heart disease, and many other diseases and conditions. Money to purchase storage, staff to manage it, data center space to keep it, and energy to power and cool it will all become important factors in overall research budgets - money that might otherwise be spent on research itself.
Content-Aware Compression and Data Deduplication for Online Storage2
Backups Present Further Challenges
Another challenge with the overwhelming amount of data growth is the strain it puts on traditional backups. When data comes in at 10 terabytes or more per day, backing up to tape the old way becomes unfeasible. Data reduction with content-aware compression and deduplication offers other alternatives for data protection and retention. Once the primary copy of the data has been processed and reduced down to one-tenth its original size, it may make sense to create a replica of that data on another storage platform, rather than trying to back it up using legacy backup tools or tape.
The replica can be stored in another location, on cheaper storage than production data. This serves the purpose of protecting data and making all data accessible in the event of a data loss on primar... [download for more]