Interview with Robin Lenz from the Leibniz Institute of Polymer Research Dresden, the 2024 award winner.
The greatest benefit of well-managed data is therefore probably our own. Making our data openly accessible afterwards is then hardly any additional effort.
What was your submitted data set about?
I submitted a data set and viewer for the spectroscopic characterisation of polymers naturally weathered in the marine environment. They are part of the JPI Oceans: microplastiX project, which is investigating the effects of microplastics in marine ecosystems.
Were there any difficulties with the FAIR storage of the data?
No unexpected. I had already dealt with this topic beforehand. For me it is the best way to work, in new projects we will definitely plan FAIR right from the start.
In the past I have worked on a project that didn’t start FAIR and didn’t need to, but in the end it was exhausting. We ended up with a lot of confusing versions of Excel files and Word documents and I realised that it would be difficult to come to a good end that way.
In the project that has now been submitted, data collections from several institutes have been used here. You can’t just throw it all together: first of all, everything has to be harmonised. And if I have to do it for myself to get an overview of the data, then I might as well do it in such a way that it is comprehensible for everyone.
Why did you choose GitHub as your repository?
I work a lot with Python to make visualisations, format data and much more. On the one hand, I have a lot of CSV files with the data content, and on the other hand a lot of Python scripts that do something with these CSV files;
This is where the software development track of Git technology comes in. I also wanted to put the data on Zenodo at the end, which is well integrated with GitHub.
However, due to the volume, hosting code and data together in a repository is no longer my preferred option.
What would be your preferred variant?
I would like to keep this separate in future. In this project, I later deposited the purely chemical data, i.e. spectra with their descriptions and metadata, on RADAR4Chem, because they can be found there independently of the code for the chemical community and can thus perhaps also be used further.
It would make more sense to track it like this right from the start: the data in a chemical repository and the code in a GitHub repository, which can then also be versioned via Zenodo so that you can refer to a specific version.
Moreover, if the project were significantly larger, with more CSV files or large measurement data filling gigabytes, there would be no other way, This can no longer be sensibly stored in a code repository.
I’m currently puzzling over how to combine this so that when you execute the code, it executes the query on the chemical repository and retrieves the data from there instead of having it stored as a copy in the code. I also spoke to the people at RADAR4Chem about this. Such a query seems a little more technically complex, but should be possible in principle. Then you would have a unique location for each type of data, i.e. code and chemical data, which would be the cleanest solution.
Do you use data management plans (DMP) in your team?
In the project from which this publication emerged, there was a, yes. We generally use an electronic lab notebook (ELN), here at the institute it is LabArchives. And we have a centralised raw data drive and definitions of what is stored where. In this respect, there is a technical infrastructure and group-internal routines that could be written into a DMP.
A decision is then made on a project-specific basis depending on the requirements as to whether or not a DMP document is drawn up.
Why did you apply for the FAIR4Chem award?
The information about the tender came from a Data Scout here at the institute. We weren’t actually ready yet, the data was still under embargo. But then I asked, got the approval and submitted it. I then lost sight of it again and was quite surprised when I received the e-mail that we had actually won.
Do you have a recommendation for other researchers who are not yet storing FAIR as to why and how they should start?
It makes sense to work according to the FAIR Data principle! You can do it for various reasons: Firstly, out of an altruistic attitude. As many people as possible should be able to use what has been produced, often financed by public funds.
But I can also look at it from a “lazy” or selfish perspective and ask: “Who will benefit most from having this data available in an open and clearly structured way?”
Most likely it will be me in the future, because I will probably have forgotten what I was thinking today and where I put things, computers will disappear, sticks will no longer be readable and so on.
The greatest benefit of well-managed data is therefore probably our own. Making our data openly accessible afterwards is then hardly any additional effort or no major change. From this perspective, too, it is therefore good and sensible to work according to the FAIR principles.
As I said, I came into this project when it was already running. I was faced with a pile of data that I had to familiarise myself with and write scripts that read it in a standardised way – so that it was all on the same level and I could understand it at all. Then I thought to myself, “I’ve already written this anyway, so I can put it online, including the app and visualisation, and everyone can understand it.” That was practically no extra work on top of what I had to do for myself anyway.