Green Hackathon 2015
DNA For Collections
Many institutions in the Health and Life Sciences maintain collections of biological specimens for a variety of reasons. Among these are gene banks in plant and animal breeding, environmental samples for monitoring purposes, specimen collections for cataloguing biodiversity, and patient samples in biomedical research.
Collection material can be a source of DNA. It is usually not easy to get sufficient DNA out of a sample that has not been collected for the explicit purpose of storing that kind of molecules. Recent developments in DNA sequencing technologies, and particularly expected developments in so called ‘single molecule sequencing’, enable access to precious samples collected over the course of hundreds of years. DNA is nature’s ultimate information-carrying molecule. DNA contains a wealth of information on the natural history of organisms and the molecular functions that define that organism. However, because DNA information is so easily digitized, it can at the same time be used to make the collections more accessible. DNA information from every organism, from bacteria to whales, can be digitized (sequenced), stored, and searched in the same way. Due to the dropping costs of high-throughput DNA sequencing, more and more of these collection specimens are selected for (re-)sequencing. This raises considerable methodological challenges, among which are:
- Handling degraded and fragmented DNA from old specimens
- The prevalence of non-model organisms and dearth of reference genomes
- Mixtures of DNA from different species in the data, either due to sampling method, post-mortem contamination, or infection by pathogens
In addition, the transformation of institutes that maintain such collections into accidental sequencing centers raises many practical challenges in terms of capacity for:
- Data storage, data retention and data publishing (something on ‘data models’, ontologies, distributed data warehouses, etc?; stressing of uniformity of the DNA data as opposed to the heterogeneity of both biology and collections of different organisms?).
- HPC for de novo assembly and annotation
- State-of-the-art bioinformatics methods development
The prospect of generating and maintaining large datastores of thousands, perhaps in the future millions, of collection samples offers novel ways in which we can start mining that data. This offers novel conceptual and practical challenges, that, interestingly, share similarities to metagenomics.
- Collection material faces similar challenges as are encountered in metagenome studies. The material might be contaminated with fungal and bacterial DNA, some of it of a much more recent origin than the sample itself. As with metagenome studies, providing context is important.
- Filling in the phylogenetic and, for species, temporal (some of these collections are old!) and geographic gaps, which is particularly important for species in regions where collecting is difficult because the original habitat disappeared, the species died out (locally or entirely) or because of legislative restrictions for sampling.
- De novo assembly based on degraded material will always be a challenge, but connecting to databases with reference genomes of the same or related species will allow making of guided assemblies as an alternative.
To address these methodological and practical challenges from a technology perspective we are organizing a 3-4 day hackathon, entitled #DNA4Collections, slated for April 2015 (exact date and location TBA)
- Rutger Vos
- Sven Warris
- Hendrik-Jan Megens
to be chosen
Venue and resources
Venue: Probably at or close to Naturalis, Leiden
- Mathijs Kattenberg (SURFsara) is willing to help with Grid and Hadoop resources.
Add yourself here
- Rutger Vos (Naturalis)
- Hendrik Jan Megens (WUR)
- Sven Warris (WUR)
to be set up after selecting dates