Join us at the next ESIP Meeting! Learn more:

Discovery Cluster: Linking Datasets to the Applications that Use Them

Discovery Cluster: Linking Datasets to the Applications that Use Them

In this post, ESIP Community Fellow Sara Lafia describes the goals and activities of the ESIP Discovery Cluster.

The ESIP Discovery Cluster, led by Chris Lynnes (NASA), is building a Usage Based Discovery tool to help researchers discover relationships between earth science datasets and the applications that use them. The “usage-based” paradigm exposes diverse ways that earth science datasets have been used, allowing others to assess if data could meet their needs. During the 2021 ESIP Summer Meeting on Thursday, July 22nd, dozens of participants from the ESIP community convened online for a “data foraging” workshop to contribute information about earth science datasets and help improve the Discovery Cluster’s tool. In just an hour, over 100 relationships between research publications and the earth science datasets that they use were uncovered and added to the Usage Based Discovery tool (Figure 1).

Figure 1. Fire related applications (left) and the datasets (right) that use them exposed in the Usage Based Discovery tool.

Teams formed around four core topics — fire, flood, sea level, and life sciences — plus a “Motley Crew”. The topics organized earth science applications, datasets, publications, and their use cases. Participants had success searching for datasets using a combination of persistent identifiers and keywords. Expertise from different disciplines and organizations enriched the search process; for example, foragers who were more familiar with specific instruments or products, like MODIS, searched specifically for applications that used those data. They also suggested domain-specific keywords, such as “coastal erosion” for sea-level datasets and applications, which revealed new use cases and relationships. Search platforms like Google Dataset Search and Scholar were also helpful, as they are beginning to expose links between some datasets and publications using Digital Object Identifiers (DOIs); participants were able to review and validate these links and add them to the tool’s usage database along with topic keywords.

Demonstrating data impact

Showing how research datasets are used in applications and publications is related to a broader effort to measure the impact of research data. The Discovery Cluster’s ESIP Community Fellow, Sara Lafia (University of Michigan, ICPSR) is developing a natural language processing approach to assist in the detection of informal data references. The team is exploring a combination of cues, such as the presence of indicator terms, to predict whether articles mention research datasets. This effort supports the continued development of the ICPSR Bibliography of Data-related Literature, a resource demonstrating the many ways that ICPSR’s research studies are cited. Findings from this ongoing research could provide insights into the expansion of the ESIP Discovery Cluster’s Usage-Based Discovery tool through machine learning, and other related efforts. 

Get involved!

Many high quality article-dataset relationships were added to the tool during the foraging workshop. The ESIP Discovery Cluster plans to continue adding more information about earth science datasets, applications, and publications that use data. The tool supports login via ORCID in order to give credit to contributors who add relationships. There may also be potential to leverage platforms like Scholix, Crossref, or DataCite to ingest more relationships, which may provide valuable training data for future machine learning applications to discover and link earth science datasets to the research publications and applications that use them. The ESIP Discovery Cluster is now classifying research articles into topics. If you would like to contribute to our cluster’s efforts to classify research articles, please visit (!) to read more about this work, get involved, and maybe even have fun while contributing!

More about Sara: Sara is a Postdoctoral Research Fellow in the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. She holds a Ph.D. in Geography from UC Santa Barbara and was the recipient of ESIP's Raskin Scholarship in 2018. Her current research is analyzing curation activities, detecting data citations, and developing metrics to track the impact of data reuse. She also enjoys developing geospatial applications, engineering linked data models, and designing data visualizations.