Picture

Ted Hart

I’m a senior data scientist in silicon valley and adjunct faculty at the University of Vermont. I build things for data: things that process it, parse it, visualize it, and analyze it. I like my beer cold, my snow deep, my mountains high, and my data open. I am a recovering academic.

Ted Hart

-

ecologist / data scientist / developer

How should we cite data?

At first I think this question almost comes across as rhetorical and silly. After all there are plenty of guidelines for citing data, why is this even a question? While it is true that people often make an analogy between citing a paper and citing a dataset, a paper is not always analogous to a dataset. From the perspective of an individual researcher generating, data this analogy often holds. In many ways the paper is just a logical extension of a data set, therefore the way you cite a dataset or a paper can almost be interchangable. But when is a dataset not like paper? That's the challenge we face at NEON with developing our citation policy. As a data provider we don't create discrete entities, but instead provide over 500 continuous data products. <!--more-->

Why cite?

The motivation for data citation is somewhat similar between a large data generating organization like NEON and an individual PI. Similar to a PI, we need to get credit for our efforts if we want continued funding. We need to show to the NSF that we provide value to the community, and it's worth it to pay my salary every week, just like PI's need credit for their data to get grants, jobs, advance in their job, etc... So that's one motivation, to recieve credit, and it also provides us with an internal metric of what data we collect people are using. Other data generation organizations like NCAR and MODIS mostly use citation as a way to measure the impact of their data, and some tracking of which data products are used.

At a bare minimum this kind of citation allows the tracking of usage, but are there other reasons to cite? On the other end of the "reasons to cite" spectrum is reproducibility. If the data set is cited, authors have identified the source of all their inference, allowing other researchers to (theoretically) reproduce a result. These two reasons for citation, tracking and reproducibility form the ends of an axis of trade-offs in citation policy.

Implications

So why does the goal of citation for the data generation agency matter? Fundamentally a citation is just a pointer to a thing. It's why we use DOI's in citations, because it provides a persistent identifier. In the case of a paper or a researcher generated dataset, the granularity is easily defined. It's the paper, or it's the dataset behind the paper. At NEON though we'll collect thousands of measurements at hundreds of locations over 30 years. We'll be able to assign DOI's at whatever spacio-temporal granularity we like, from individual measurements taken every second, all the way up the entire organization itself. If granularity of identification follows from why we want people to cite NEON, we as an organization need to decide what our goals are. Furthermore, depending on where an individual or organizations motivations fall on the tracking/reproducibility axis also has implications for how onerous citation is for the end-user. If we just want to track the fact that NEON data was used, it's easy to just say: "Cite NEON as an organization". It's a simple static citation that we can provide to every user. If we want to emphasize reproducibility it requires that we mint DOI's for very specific data streams. This requires that users of our data are very specific about what they use in their publication or product. It will require tracking on their part, making citation generation more user vicious than just providing static text. All this gets to the fact that citing data isn't as easy as it first seems, and is intimately tied to an organizations goals.

Where we're headed

As the person in charge of developing citation policy for NEON, it should be no secret how I want to push us as an organization. Yes simple tracking citations like MODIS uses will give us some useful data, but frankly I think we can do better. At NEON we have a chance to push the boundaries of reproducibility. To that end we're working on developing a workflow where NEON data users will be able to build custom spacio-temporal data sets of multiple data products. We can keep track of these custom data sets in a user's account, generating an internal query code. Users will be able to return to these saved searches, modify them to closely resemble the data used in a paper, and generate a DOI. Essentially we are trying to mimic what would happen if a researcher created their own data set. Then to reproduce science done with NEON data, people can just resolve the DOI and get the exact same data set. However this requires that a user do their science, and then return to NEON after downloading data and generating a citation.

At the same time we realize this is user vicious, and people may want to cite us without having to keep track of multiple data streams, or want a more informal citation method for non-peer reviewed scholarly works. In that case we are working on a citation that is just for the observatory as a whole. The concern though is that providing a simple static method, most people will forego the more onerous but highly reproducible citation. So we're left wondering, should we provide only one citation method to really force the reproduciblilty angle? Should we provide more than one citation method to accomodate multiple use cases? We'd love some feedback from the community to help guide this process as it is still being developed. Any comments would be greatly appreciated.