Focus meeting "From reproducibility to reusability"
Many of the minimal metadata standards that are in use now are directed more towards reproducibility than towards reusability of data. For example, for reusability much more detail is needed about the composition of a cohort than that it is "similar in composition" to a reference cohort.
In this meeting we will explore our experience with this. How can we make sure when we make data FAIR that the R really means reusability for other purposes? What kind of extensions of standards need to be made? What existing metadata standards can be used?
Practically, it is really hard to find suitable data sets. If you have found them, they are not really reusable. The data description is aimed towards reproducibility. Minimal metadata standards are biased in this direction. How can we avoid this? Can we make reasonable metadata standards? How can electronic lab notebook be used? How can we make sure researchers report not only that they had a look at the age distribution of the population, but actually report the ages?
Rob Hooft. Chris Evelo will introduce the meeting
Subgoals/Lecture and discussion topics
- Define the problem and find "Low hanging fruits" solutions
- Set up a group that will define a Statement "how we can use the EOSC to solve this"?
Study capturing, phenotype description harmonisation.
EGA links to Biosamples according to Jordi.
Have minimal phenotype standards (group vs individual), get in Biosamples/biostudies, capture in dbNP and Molgenis.
Use in EGA (TraIT/EGA project), Arrayexpress, Pride, Metabolights (this is ISA specification basically).
The benefit of more formal semantics, not only using a vocabulary.
Could this become an ELIXIR implementation study?
We need to make a story for foundational ontologies that is convincing also people that have become skeptical from earlier failed attempts.
From message Chris sent to Susanna Sansone:
- A lot of people start to agree on main principles like FAIR or approaches like the on behind ISA. And that is a good thing.
- We have a lot of initiatives around that too. You are probably aware of a lot more than I am.
- A lot of the already captured data, especially in the genomics field tries to follow these principles and approaches, even if we still try to improve that technically (by using linked data approaches for instance).
- Nevertheless if I work with systems biologists that really try to reuse data they stumble on a number of issues. Most of these are well known and mainly make things less computer readable. Some of these we could tackle in collaboration between different initiatives. For instance a software framework that can be used to record descriptions using ontology terms, where we also record the ontology used and that can do an automatic ontology search and add the same info for terms not used, plus integration of ontology mapping services in that would be useful. Several groups are fighting these problems, and collaboration could help also to have it more interoperable from the start.
- What worries me is the problems people run into if they do the work manually and really want to find and combine data. What typically happens in studies it that there is study data and data describing the group averages (and differences there in) for things that were not studied. To give you a typical example:
- People do a diet study with two diets (described in detail) and look at outcomes like weight change and have omics data.
- The report what they typically put in the paper too. Groups were not different in other things (gender distribution, average age, blood pressure, you name it).
- Now others want to reuse that data to look at age or gender effects, or something like that. That information isn’t there. In fact for reproducibility of the main outcomes it isn’t even needed. So it looks like we need different descriptive information for reproduce and for reuse.
- I am not sure about the best solution. Awareness definitely helps, which leads to “let’s discuss and come with a position paper or something like that”. But for different fields we could also come up with a suggestion for what to capture on a sample or individual level whenever possible.
- Of course one of the problems is that asking people to add more information will also mean more work and possibly will make it less likely they actually do it. That is another reason to describe it well. It is also a reason to think about collaboration between capture tools (ISA-creator, dbNP, Molgenis, and various electronic notebook initiatives) and how these can capture interoperable information to start with. Then think about where the study descriptions should go (biosamples, biostudies, you may have more ideas) and again how this then propagates to the actual data repositories and how all that is linked. Basically at least reduce the need to do that to doing it only once. Yes, a lot that sounds like BII which I still think is a brilliant idea, although we might make it even more of a hub as you had in mind.
- We have discussed this a bit between involved groups in the Netherlands (a.o. Morris and Jildau for ENPADASI). We plan to organise a Dutch DTL Focus meeting around that. Rob will organise that. I think it would be good if you or somebody from your group could join that too.
Somewhere in October
Somewhere in Utrecht
- Morris will ask Floris Imhann a Medical Doctor in Groningen with much data expertise in gastro-enterology. (Floris is back in the clinic, if he can make it he can speak)
- Chris will ask Lydia Afman - between data production and data reuse (nutrigenomics) (Lydia is interested)
- Eliana Papoutsoglou - Using more extensive metadata [can speak starting November, earlier her research is not ready for a talk]
- Chris Evelo - Metadata capturing techniques
- Klemen Zupancic - Asked, we will need funding to fly him in.
- Jildau - Herself
- Chris Evelo
- Morris Swertz
- Luiz Bonino
- Lars Eijssen
- Jildau Bouwman
- Richard Finkers
- Eliana Papoutsoglou
- Jaap Heringa
- Sanne Abeln
- ISA? Susanna S, Peter McQ: allowed to come, or take part in discussion remotely.
- Johanna McEntyre
- Kees van Bochove
- Raise interest at the 19/1-20/1 ENPADASI meeting
- Peter Doorn (DANS): He preaches that "R" in fair is a consequence of "FAI"....
- Someone from Susanna Sansone's group
13:00 Coffee 13:15 Welcome and Purpose 13:20 Agenda and Ground Rules 13:25 First talk 13:45 Questions [Quick/special questions answered, difficult/general ones on flipover] 13:50 Second talk 14:10 Questions [Quick/special questions answered, difficult/general ones on flipover] 14:15 Tea break 14:35 Third Talk: Eliana Papoutsoglou: how extra metadata helps 14:55 Questions [Quick/special questions answered, difficult/general ones on flipover] 15:00 Fourth Talk: Chris Evelo: tools we need to capture the metadata along the way 15:20 Questions [Quick/special questions answered, difficult/general ones on flipover] 15:25 Tea break [Meeting chair organizes the flipover questions] 15:45 Discussion [Based on the organized flipover questions] 16:30 Evaluation 16:35 Parking lot 16:40 Breakup 17:00 Out!