Find your people at the ESIP Meeting: esipfed.org/meetings
Q&A with Falkenberg Awardee Ryan Abernathey
Credit: August Politis via Unsplash
Ryan Abernathey is the 2021 Charles S. Falkenberg Award recipient, a joint award through the American Geophysical Union (AGU) and Earth Science Information Partners (ESIP). In his Q&A, Abernathey takes on the big ideas and small details of team science.
The Falkenberg Award winner is always invited to speak at the summer ESIP Meeting. As the 2021 Falkenberg Awardee, Abernathey will speak at the Friday plenary during the 2022 July ESIP Meeting — helping shake up the norms of knowledge sharing and data exchange.
Across the broad open science movement, much of the work comes down to a single word: collaboration. The question is how to collaborate.
Open science implementation often focuses on changing cultures and values – which is key to including more groups in research leadership – and for many of the same reasons, Abernathey sees the need to make sure open science’s technical issues get addressed as well.
In his Q&A, Abernathey (RA) dives into how he and his team line up the gears of their research enterprise.
What I do: Rework scientific collaborations and build new infrastructure and tools to improve how we share data and knowledge about Earth systems.
Why I do it: Our group wants to work more collaboratively but the technical systems we have for working on the same analysis code base, sharing big datasets, and putting together reproducible workflows are still full of pain points.
Falkenberg Award
The Charles S. Falkenberg Award is an annual award sponsored by AGU and ESIP to recognize an early to mid-career scientist who has contributed to the quality of life, economic opportunities and stewardship of the planet through the use of Earth science information and to the public awareness of the importance of understanding our planet.
Q: Let’s talk about team science. How do you define it?
RA: The norm in academic research is that each scientist (PhD student, postdoc, etc.) has their project, they work on it mostly in isolation, and maybe meet with a mentor once a week for some feedback.
In our group, people observed that this is a pretty lonely way of working. Regardless of whether it’s efficient or not, people in our group want to be interacting with others most of the time. So can we make science more collaborative on a day-to-day basis, just within our own research group? If so, and if we can define practices, tools and approaches that work on this scale, maybe we can scale them up more broadly across the whole scientific enterprise.
Q: What is an example of this co-creation format in action?
RA: So we are trying this now with five scientists in our group: a research scientist, a postdoc, a software engineer, a data engineer and an undergraduate intern.
We are trying to all work together on the same paper. This research topic involves air-sea interaction: We want to better understand how small-scale structures in the ocean and atmosphere (eddies, fronts, etc.) impact the large-scale exchange of heat, moisture and momentum between ocean and atmosphere. We are doing this by studying super-high-resolution climate models which directly resolve these interactions. This is a data-intensive project, which is kind of the theme for our research group.
One big challenge has been to think about how to effectively divide up the work, to turn it from a sequential list of steps that are executed by one individual to a collection of independent tasks that are done in parallel by the team. So, we have the data engineer and postdoc working on identifying relevant datasets and ingesting them into our cloud data lake, while the research scientist and software engineer work on coding up the algorithms and packaging them for reuse by the team (and the rest of the world). The undergrad, meanwhile, is doing a deep dive on the theory and mathematics of air-sea interaction.
These parallel threads are starting to converge — and it’s really exciting. But it was definitely slow going to spec out the project structure and roadmap.
Q: What are challenges that team science helps with? What are challenges within team science itself?
RA: There are social challenges with team science – splitting up the work, deciding who does what, assigning credit appropriately – but at least in our group, people are eager and willing to confront those. I am much more fixated right now on the technical challenges.
The status quo for how science is done is that each scientist has their own computer, their own code and their own data on their hard drive. Maybe they use a shared server or cluster, but these systems are explicitly structured in a way to isolate each user from the others. This style of infrastructure makes collaboration effectively impossible. The burden of sharing an entire project is just too heavy.
This is why lots of us in the open science community are very excited about the cloud. Basically everyone is now used to collaborating on documents via Google Docs. The idea of emailing around word documents is accepted as old fashioned and inefficient. Working on a shared document in the cloud feels natural, and everyone sees the benefits. To me the key question is: How can we reach the same level of fluid collaboration with the actual science? That is, the data, the code, the workflows, the visualizations, everything that goes into making an actual scientific discovery.
We can see a lot of the elements we need for that type of collaboration. GitHub is obviously a huge part of this, and we use it extremely heavily in our group. It’s a platform for sharing code, talking about code and coordinating code-related projects. But science is more than just code. We need a place to actually run the code. And we need data to operate on.
In Pangeo, we have really pushed forward the idea of “Jupyter in the cloud” as a platform for research. Combined with cloud-data storage, this paradigm allows scientists to share code, environments and data in a way that makes it pretty easy to collaborate on data-intensive projects. But there are still tons of pain points. Sharing notebooks is not as easy as it could be. Cloud object storage is hard to work with. No one knows how to actually pay for cloud computing. So there is a lot of work still to be done to realize the potential of the cloud for team science.
Q: In our Earth science data community, we care a lot about giving credit where credit is due. How can we encourage collaboration, open sharing, and proper attribution?
RA: The status quo that hiring and promotion committees care about is only one kind of scientific output: peer-reviewed papers in high-profile journals. Thanks to efforts by the FAIR-data movement, and the research software engineering community, this is starting to change. We are now encouraged by journals and funding agencies to publish and cite datasets and software used in our research. However, I still don’t see those citations being tracked or incorporated into hiring decisions. I recently read an interesting (and discouraging!) study presented at IDCC 2022 that revealed that data citations are really not being properly tracked or indexed by our scholarly infrastructure. So despite all the emphasis on open data, we are not going the last mile and actually tracking the impact of data. The situation is even worse for software citation.
In my opinion, we need to completely rethink the citation system and start conceptualizing research as a knowledge graph. The nodes in the graph are the different types of entities in the research enterprise: data, code, models, papers, people, funders, infrastructure, etc. The edges are the connections between them: What data was used for this paper? Who wrote the code underlying that model? People in the “meta-science” community have been talking about ontologies and knowledge graphs for decades. Shout out to all the librarians! But we need to invest more in the technical infrastructure for tracking the global knowledge graph if we want to change the culture of collaboration in science.