3- Data-storage, preservation and curation are fundamental to the scientific process

From dtls
Jump to: navigation, search

This is one of thirteen recommendations for Data Stewardship as formulated by the Netherlands E-Science Centre.

The E-science centre writes

Turning the challenge of dealing with the so-called data-deluge into a series of opportunities for more rapid scientific discovery will require high-standards of data-management throughout research. This requires high compliance rates and suitable curation to ensure that deposited data retains value.

What DTL recommends for the Data Stewardship plan

Answer the following questions:

  • Will your data be verified? Manually curated? Or via automated quality assurance programs?
  • Can you guarantee a certain level of correctness?
  • Do your numerical data incorporate standard uncertainties or variance? Are your qualitative data accompanied by a reliability?
  • Will you be able to process suggested corrections from external users of the data? During the project? After?
  • How will you be able, at any moment, to point out what version of what software has been run on which input file to produce a certain result?

Experience from DTL

Generic Technical issues

  • F1 only backup raw and result data -> keep ‘intermediate’ data in seperate folders
  • F2 are files not corrupted? -> keep checksums of key data files; always use on transfer
  • F3 minimize restore times -> dedicated server to restore from tape
  • R1 what if main storage is offline? -> have two independent storage arrays + copy on grid
  • R2 what if the compute cluster is offline? -> have two independent clusters + use of grid
  • H1 what if files get accidently deleted -> versioned backup + test of the restore...

Generic Provenance

  • F4 understand the files -> agreed upon SOP for naming scheme
  • F5 find file/sample swaps -> have each sample measured using independent method
  • M1 unclear what is the ‘true’ file (sample) list -> central file list, owned by data manager
  • M2 unclear what has changed -> put file list in svn
  • M3 how can we be sure that restore is not corrupt -> have checksum in file list
  • M4 what exact analysis was run -> standalone scripts that include versions of tools/data used
  • H2 what if analysis was done incorrectly -> do two independent analyses for important analyses
  • H3 faulty changes to the software -> have second person review before use; open source
  • H4 how ensure analysis are same -> use script templates that automate whole analysis
  • R3 what if pipelines are different between clusters or grid? -> make pipelines portable

Sector specific

Specific per technology

Genomics

  • F8 relationships of data -> use persistent sample / lane identifiers through project, e.g. A1c
  • M5 what is provenance of a data file -> in header of VCF keep trail of operations applied