Picture

Ted Hart

I’m a senior data scientist in silicon valley and adjunct faculty at the University of Vermont. I build things for data: things that process it, parse it, visualize it, and analyze it. I like my beer cold, my snow deep, my mountains high, and my data open. I am a recovering academic.

Ted Hart

-

ecologist / data scientist / developer

Just get over yourself and share your data

I'm probably a bit late to the game one the whole #PLoSfail controversy over data sharing and archiving. I think one of the best things that's come out of PLoS pushing the policy is that it's opened up a larger discussion about a variety of topics. Should we share our data? Who owns data? Is data the same as software?. My favorite post on the subject so far has to be Matt MacManes' which in short says "Get the fuck over yourself scientists, you're not that special now share your fucking data." (Contrary to Hope Jahren I love to say fuck). <!--more-->

Should people share their data?

In short yes. Data + inference and analysis is what makes up science. The former is just a bunch of pieces of paper or bits with out the former and the inverse is just speculation. Also, if we accept that reproducibility is part of science, it can't be done without the data. I know there is a further argument that says true reproducibility is achieved only through doing an experiment from start to finish, recollecting data, redoing analysis. But I think this is a good practical step towards reproducibility.

Who owns the data?

I'm sure there's a legal answer to this, so I'll just agree with Matt McManes that in spirit you don't own your data. You didn't pay for it, I doubt you actually collected it all (even grad students have undergrad minions). Here in Boulder, the city doesn't recognize pet ownership, instead since no living thing can own another living thing, I am simply my dog's "guardian". I think of my data the same way. I collected it and I care for it, but it's not mine, it's just a pleasant companion.

Is software the same as data?

As someone who develops software, I'm going to say an unequivocal yes. Just like you can't do your science without data, you can't do it without software these days. If you feel like someone who uses your data needs to be a coauthor, imagine if Brian Ripley demanded he be a coauthor on every paper that used R? It sounds ludicrous (for a variety of reasons). Nonetheless both are prerequisites for creating the end product of a publication. I just don't see how someone could be stingy with their data, but happy to gobble up as much code as they can to implement some fancy method that someone else developed and shared (like you're not doing with your data).

People have offered up a variety of arguments that I just can't really swallow. Perhaps the most inane was the notion that archiving data is hard, there are no standards, different labs have different practices, and how we will ever figure out which one to use? Admittedly, this is a really hard problem. It's my job 40 hours a week here at NEON. But this is hardly an intractable problem. Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark.

No, I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn't have to know how to do PCR prior to 1983, but now you do. In the 1990's how many ecologists could do a mixed-effects model? Now I see them all the time. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management . So I say suck it up, share the data behind the paper, manage your data well, and let's all get on with our lives.