Software citations, Gypsy Science, and Beer

ESIP Communication

May 01, 2014

Community Fellows

At the end of March, the NYTimes Magazine had a great feature about two brothers that are “gypsy brewers” – they dream up beers, but outsource all of the actual brewing.

Mikkel, the brother most famous in the USA, put it this way:

“I don’t enjoy making beer. I like making recipes and hanging out.”

Say what you will about the work ethic of the Danes, but I think most people who received an advanced degree in a field other than computer science would say something similar about software:

“I don’t like writing code. I like using software and finding things out.”

And in at least some fields, especially those that depend on community infrastructures, this process should sound eerily familiar:

A person has an idea (a recipe), some notions of how to execute it (ingredients / processes), and what the result should look like (taste). The person with the idea isn’t terribly interested in doing all of the labor to produce a result, but they are very interested in analyzing the results (i.e. drinking the beer.)

This year I’m doing a fellowship at NCAR, and I see this process play out in the following ways:

Someone working on a problem that requires the use of an earth-system model will design a study, and then write some pseudocode that achieves the basis of what they would like a model to simulate. They hand that work off to a software engineer who then refactors and optimizes the code (or the model parameters) to run on an HPC.

The simulation is run …

Eventually, the results of the simulation are staged on a storage device and made available to the scientist for his or her analysis.

Being naïve, and new to this field these are two questions I keep asking:

1. How does the software engineer get recognized for this work? The same question could be asked about the people that actually brew Mikkel’s beer. Is a simple “brewed at…” sufficient attribution for the work that went into producing a significant result?

2. Is this ‘gypsy science’ ?

I think the answer to both questions are related, but require a quite a bit of historical context.

Gypsy Science in the 18^th Century

A divided, and invisible labor force has been used in laboratory work since at least the 18th century, when chemist and natural philosopher Robert Boyle used to design experiments from his bedside table and hand them off to lab assistants to actually execute.

In writing up their results, Boyle would often express his thanks for being able to complete “experiments by others’ hands” (1772, vol. 2, p. 14).

As Steve Shapin, a historian of science tells it, Boyle was by no means unique – many of his contemporaries did very little of the labor, and even less of the design and innovation with technical aspects of the laboratory work that gained them so much fame (1989).

Was Boyle’s prodigious output, and the Scientific Revolution of his contemporaries predicated on ‘Gypsy Science?’

I don’t know.

I do know that two and a half centuries later the labor and skills that make the results of many important science projects possible remains invisible. And, that a formal reward system that privileges first authors and their publications / citations is completely outdated.

What’s changed?

What is exceptionally different about contemporary work in software and data-intensive science versus the physical labor done by technicians of Boyle’s laboratory are what we might call digital trace data.

Where historians like Shapin have attempted to hunt down any esoteric mention of “others hands” to understand the role of technicians in 18th century laboratories, people studying software labor in contemporary settings (i.e. me) see traces of this work all over– in software version control systems, in the log data of digital repositories, and in project management tools like JIRA.

Trace data on “who does what?” in contemporary science is abundant.

So, it seems bizarre to me that often times the proposed remedy to problems of attribution in software development are based in 18th century workflows: citations.

Not by citation alone…

From a reproducibility standpoint – Yes! We need code to be a first class object that can be cited for the sake of reproducibility.

We also need ways to archive and assign persistent identifiers to a piece of software used to produce a research finding. That is undeniably important, and luckily some very smart people are doing some very good work on this issue.

But, from an attribution or acknowledgement standpoint relying on software citations seems like a terribly inefficient way to distribute credit.

– Citations take a very long time to accrue;

– Citations are difficult to aggregate and normalize;

– Authorship is often unevenly distributed and oddly recognized by existing citation-based reward systems; and

– Citation data are indexed by for-profit companies that then turn around and sell them back to people like me to analyze.

If not citations, what?

One suggestion has been to focus on the types of trace data that I mentioned earlier in order to create something like a programmers h-index (Capiluppi et al. 2012)

I tend to get behind the Impact Story theme that an h-index, in whatever form, is a limited view of an individual’s impact. To really move away from reward systems that rely on a one dimensional metrics, we need the kind of innovative thinking that is happening around altmetrics at start-ups like Impact Story, or the OSS report card project which aggregates and reanalyzes Github metrics to give you an OSS contribution grade.

Software Metrics + ESIP

There are many directions for this work to go, and this will undoubtedly be a trial and error process. But, I think the ESIP community is in a unique position to reflect on what makes developing Earth and Space Science software unique – and how we might best take advantage of existing trace data to create metrics that value this community’s contribution to contemporary science work (in all of its shapes and forms).

I don’t have great suggestions on where to start (yet!) but I think it’s an exciting topic to work on and think about. And, so does NSF.

If you care about these issues, and want to help us work on them in the ESIP software cluster there are two ways you can join us:

1. Our monthly telecom is the second Wednesday of each month. Our next meeting will be on Wednesday, May 14th at 3pm EST. Notes from our past meetings and general information about the cluster is on the ESIP commons (http://wiki.esipfed.org/index.php/Science_Software)

2. Matt Mayernik and I have put together a session proposal for the ESIP summer meeting dealing with this topic (http://commons.esipfed.org/node/2330) – please come talk about this, or a related issue.

Or. If you just want to talk about beer, Boyle, or anything else I’ve mentioned here

drop me a line at nmweber@illinois.edu or find me on twitter @nniiicc.

Works Cited

Boyle, R. (1772) Compete. Works, ed. T. Birch, 6 vols.

Capiluppi, A., Serebrenik, A., & Youssef, A. (2012). Developing an h-index for OSS developers. In Mining Software Repositories (MSR), 2012 9th IEEE Working Conference on (pp. 251-254). IEEE.

Shapin, S. (1989) “The invisible technician.” American scientist 77(6) 554-563.

Share This Post

Software citations, Gypsy Science, and Beer

More Stories of Earth Science Data

Guest Blog: Earth Science Data Should Relate Science to Society

Guest Blog: Reproducible data pipelines in R with {targets}

ESIP Celebrates 2025 Award Winners

Quick Links

Contact