Accelerate: Life Sciences: A featured DDN Success Story. IDSC Advanced Computing correlates viruses with gastrointestinal cancers for the cancer genome atlas 400% faster! The University of Miami Institute for Data Science and Computing maintains one of the largest centralized academic cyber infrastructures in the country, which is integral to addressing major scientific challenges, and solving many of today’s most challenging problems. Working with IDSC, more than 2,000 researchers, faculty, staff, and students across multiple disciplines collaborate on diverse and interdisciplinary projects requiring Advanced Computing resources.
• Diverse, interdisciplinary research projects required massive compute and storage power as well as integrated data lifecycle movement and management
• Highly demanding I/O and heavy interactivity requirements from next-gen sequencing intensifi ed data generation, analysis and management
• Powerful, flexible file system was required to handle large parallel jobs and smaller, shorter serial jobs
• Data surges during analysis created “data-in-flight” challenges
An end-to-end, high performance DDN GRIDScaler® solution featuring a GS12K™ scale-out appliance with an embedded IBM® GPFS™ parallel file system
• Links between certain viruses and gastrointestinal cancers discovered with computation not possible before
• With DDN’s high performance I/O, IDSC has reduced genomics compute and analysis time from 72 to 17 hours
• The ability to meet varied research workflow demands enables IDSC to accelerate data analysis and speed scientific discoveries
• Best-in-class performance for genomics assembly, alignment and mapping has proven invaluable in supporting major medical research into Alzheimer’s, Parkinson’s and gastrointestinal cancers
• High performance storage and transparent data movement lets IDSC scale storage without adding complexity
|IDSC provides hardware, software development, and analytics expertise to support a variety of research areas, including:
According to Dr. Nicholas Tsinoremas (IDSC Director), IDSC was founded on the premise that data drives discovery. Therefore, keeping pace with data growth is of paramount importance.
“Data-intensive discovery and multi-scale interdisciplinary approaches are becoming more prevalent in the way that sciences and engineering generate knowledge,” Nick explains. “The speed at which scientific disciplines advance, depends in large part on how effectively researchers collaborate with one another and with experts in the areas of workflow management, data management, data mining, decision support, visualization, and cloud computing.”
Another guiding principle is the imperative to manage the entire data lifecycle as seamlessly as possible to streamline research workflow.
“We have integrated the Advanced Computing environment with our data capture and analytics environments, so movement is transparent between different research steps,” Nick adds. “This level of interactive processing speeds the delivery of data from sensors and instruments to the desktop of analysts and ultimately, into the hands of science-based decision makers.”
Unlike other advanced computing centers that originated as simulators, IDSC has always put a lot of emphasis on data driving scientific results. Approximately 50% of IDSC’s users come from UM’s Miller School of Medicine with ongoing projects at the John P. Hussman Institute for Human Genomics (such as research into Alzheimer’s disease), and The Miami Project To Cure Paralysis. The remaining 50% of users cover Marine and Atmospheric Sciences (RSMAS) as well as Engineering, along with Arts and Sciences, Architecture, Music, and Business.
Other notable projects requiring massive compute and storage power include cancer-biomarker research at the Sylvester Comprehensive Care Center, and the University of Miami’s Grand Lagrangian Deployment project [sponsored by the Gulf of Mexico Research Institute (GOMRI)], which is the largest oceanic dispersion experiment of its kind to explore surface flows near the site of the April 2010 Deepwater Horizon oil spill.
“Translating research requirements into actionable technology is no small feat,” says Joel Zysman, CCS Director, Advanced Computing. “Because of advances in computing, we now generate massive amounts of data from a multitude of models running multiple simulations simultaneously. All of this data must be stored, analyzed, and distributed to decision makers.”
For example, the Rosenstiel School regularly runs simulations of multiple climate models, which then are distributed to a variety of decision makers to determine the impact of climate change on water engineering, precipitation, and local water-management districts. In supporting research demands, IDSC generates upwards of 50TB each year for this project alone.
Additionally, the explosion of next-generation sequencing has had a major impact on compute and storage demands, as it’s now possible to produce more and larger datasets, which often create processing bottlenecks.
At IDSC, the heavy I/O [input/output] required to create four billion reads from one genome in a couple of days only intensifies when the data from the reads needs to be managed and analyzed. “The process of validating data and detecting variants in DNA sequences through SNP calling requires both high throughput and extremely high levels of interactivity,” adds Zysman. “Our goal was to reduce the time to perform the number crunching needed to map and merge hundreds of thousands of files so data could be analyzed faster.”
Aside from providing sufficient storage power to meet both high I/O and interactive processing demands, IDSC needed a powerful file system that was flexible enough to handle very large parallel jobs as well as smaller, shorter serial jobs. The key for IDSC was the ability to take advantage of very high I/O as well as very low IOPS [input/output operations per second, pronounced “eye ops”] without having to move data around, which would have required an inordinate amount of time and administrative overhead.
Additionally, IDSC had to address “data-in-flight” challenges, resulting from major data surges during analysis. As the creation of intermedia files often resulted in a 10X spike in storage, it was critical to scale and support petabytes of machine-generated data without adding a layer of complexity or creating inefficiencies.
The ideal storage solution for IDSC would provide a single platform for both high-throughput genomics and highly interactive research collaboration. The Institute needed to accommodate its entire data lifecycle, so users didn’t have to deal directly with a lot of data movement. DDN Storage was superior to competing storage platforms with its ability to leverage one robust, easily managed platform for ensuring high performance, simplified collaboration, and accelerated data analytics.
“Where DDN really stood out is in the ability to adapt to whatever we would need,” says Zysman. “We have both IOPS-centric storage and the deep, slower I/O pool at full bandwidth. No one else could do that.”
DDN’s GS12K scale-out file storage appliance with one petabyte of storage was best suited for meeting the University’s growing IOPS and bandwidth requirements, while ensuring extremely fast application performance. Moreover, the embedded GPFS™ parallel file system eliminated the need to purchase and manage external servers, network adapters and switches. “DDN with GPFS was the best combination in terms of performance and integration,” Joel adds. “Instead of having disparate file systems and different queues, we were able to centralize everything and then scale accordingly while managing it all through a single pane of glass.”
Thanks to DDN’s massively scalable GS12K clusters, IDSC can meet its varied workflow demands, which are more file-set than file-system based.
“With DDN, the ingest pool is the same as our processing area, so we don’t need resources dedicated to pre-staging data,” explains Zysman. “Once data is entered, it can be migrated automatically to a lower tier of storage, which allows researchers to easily interact and collaborate because data never goes offline.”
DDN’s transparent data movement also is ideally suited for addressing IDSC’s back-end genomics workflows and interactive jobs, as the team can leverage one platform to capture, download, or move data. With DDN, the IDSC team has the confidence to process input from sequencing pipelines, and, to handle the data computations generated by its 15 Illumina HiSeq next-generation sequencing instruments. The team also can easily support all its application needs, including the use of BWA and Bowtie for initial mapping as well as SAMTools and GATK for variant analysis.
In addition to supporting IDSC’s internal sequencing requirements, DDN provides the necessary scalability and performance to support sequenced data from external databases, such as DBGap, and The Cancer Genome Atlas (TCGA), along with data generated by a dozen collaborations with academic and