3- Data-storage, preservation and curation are fundamental to the scientific process
This is one of thirteen recommendations for Data Stewardship as formulated by the Netherlands E-Science Centre.
The E-science centre writes
- Turning the challenge of dealing with the so-called data-deluge into a series of opportunities for more rapid scientific discovery will require high-standards of data-management throughout research. This requires high compliance rates and suitable curation to ensure that deposited data retains value.
What DTL recommends for the Data Stewardship plan
Answer the following questions:
- Will your data be verified? Manually curated? Or via automated quality assurance programs?
- Can you guarantee a certain level of correctness?
- Do your numerical data incorporate standard uncertainties or variance? Are your qualitative data accompanied by a reliability?
- Will you be able to process suggested corrections from external users of the data? During the project? After?
- How will you be able, at any moment, to point out what version of what software has been run on which input file to produce a certain result?
Experience from DTL
Generic Technical issues
- F1 only backup raw and result data -> keep ‘intermediate’ data in seperate folders
- F2 are files not corrupted? -> keep checksums of key data files; always use on transfer
- F3 minimize restore times -> dedicated server to restore from tape
- R1 what if main storage is offline? -> have two independent storage arrays + copy on grid
- R2 what if the compute cluster is offline? -> have two independent clusters + use of grid
- H1 what if files get accidently deleted -> versioned backup + test of the restore...
- F4 understand the files -> agreed upon SOP for naming scheme
- F5 find file/sample swaps -> have each sample measured using independent method
- M1 unclear what is the ‘true’ file (sample) list -> central file list, owned by data manager
- M2 unclear what has changed -> put file list in svn
- M3 how can we be sure that restore is not corrupt -> have checksum in file list
- M4 what exact analysis was run -> standalone scripts that include versions of tools/data used
- H2 what if analysis was done incorrectly -> do two independent analyses for important analyses
- H3 faulty changes to the software -> have second person review before use; open source
- H4 how ensure analysis are same -> use script templates that automate whole analysis
- R3 what if pipelines are different between clusters or grid? -> make pipelines portable
Specific per technology
- F8 relationships of data -> use persistent sample / lane identifiers through project, e.g. A1c
- M5 what is provenance of a data file -> in header of VCF keep trail of operations applied