Data Management: The Plan

By max, Mon 16 November 2015, in category Hacking

Perhaps because it's not something I would have done on my own, thanks to the prodding of Daniel Mietchen, I  have created a data management plan for my open-PhD adventure. What is a data management plan (DMP), you might ask? Now that I'm up to speed, I can tell you that it's a document in which you set out the parameters for how you will create, share, and store the outcomes of a project. It's also the sort of thing you go through in order to pose detail questions to yourself and make rigorous your otherwise slightly sloppy thinking.

Questions of the license to use for data produced were quite easy for me: I'm dedicated to using as open a license as possible while requiring attribution. Other topics however gave me pause in considering how I will handle the spew of data exhaust I produce all the time. For instance, documentation; how will I keep track of what those bits represent?

There is a large amount of hubris to sidestep with documentation. Thinking that I could keep a grand folder structure, or complete list seems like myopic optimism. I think there is some wisdom to draw from the casual obvservation that even as organizational tools online improve most coordination still gets done by plaintext emails. That's why I have no plan to keep a megalist sort of card-catalogue of documentation, but to include it as files alongside the datae. I will aim for IPython/Jupyterized inline documentation of data handling when possible, but resorting to standard readme.md files in the directory otherwise. That is, I'll be relying on search as a organisational principle, so the key will be making the barrier to searchable documentation as low as possible - like writing quick files.

On the question of archiving; how will I keep the data around for a long time? This posed a very difficult question for me because I wasn't sure exactly how much longevity I want from my data. 2 years, 5 years, 20 years? Starting with needs-constraints rather than desires, I thought that storing my data should be easy and free (as in beer). That's why I'm opting for using Github in the DMP. But there are two worries with Github, one is that it limits files to 100mb, so perhaps it's not suitable for all possible data. The second concern is that Github is a company, like any other swashbuckler with venture-capitalism-driven bravado they could disappear easily. So then I thought that I might rely on some HTTP accessible servers at my University: no filesize limits, corporate independency, and tape backup storage. I am rather happy with those combinations, but if I wanted to invest a lot more effort for not a lot more benefit then I could nitpick at both of them being centrally managed regardless of profit motive. The only way to get around this would be to what? Create torrents of my data and seed them from personal servers. The idealist in me is tempted, but the time-scheduler sees the headache.

If you haven't had to make a data management plan, I could understand that it seems rather abstruse and time consuming. However you should do it anyway because a) it forces you to think more closely about your data on an abstract level, and b) http://dmptool.org/ makes it 50% less scarier than it needs to be. Oh and in the interest of openness after all, my DMP is here: Open-PhD data management plan.pdf.