Q&A: Weaving ML in Earth Science
Earth Science Information Partners (ESIP) has a special request for proposals to support machine-learning (ML) tutorials for Earth science. ESIP participants Ziheng (Jensen) Sun from George Mason University and Nicoleta Cristea from the eScience Institute at the University of Washington share their new collaboration.
Machine Learning (ML) in Earth Science
Swollen streams and sediment plumes reveal extensive flooding along Australia's coast.
The new ESIP Lab RFP supports machine-learning tutorials to better understand such
events and will focus on projects in hydrology, seismology and the cryosphere.
Credit: NASA, OLI 2
Sun and Cristea have different disciplinary backgrounds. But they discovered common ground through analyzing Earth observation data. They are now collaborating on a new National Science Foundation (NSF) cybertraining grant to build the GeoScience MAchine Learning Resources and Training (GeoSMART) framework.
In their Q&A, Sun (ZS) and Cristea (NC) explain how the ESIP Lab connected them and opened up opportunities to address gaps in ML education for geoscientists.
Q: How did the ESIP Lab help you in your work?
ZS: The ESIP Lab jumpstarted my career as an independent PI. As a person with almost zero experiences on project management except ideas and fantasies about Earth AI, ESIP Lab gave me the very first chance to try my idea and shine. I had six months of valuable time to implement the first version of Geoweaver and presented it in the ESIP Winter Meeting 2019. I met Nicoleta the same year in ESIP and we have worked together ever since. We went on to win NASA and NSF awards to develop Geoweaver into a reliable tool for Earth AI practitioners.
NC: The ESIP Lab is our partner on two NSF-funded projects. The first one originated from the ESIP Lab snow mapping project that later became larger, NSF-funded work. The second one is the development of GeoSMART. Our focus is first on hydrology, seismology and cryospheric science. Through ESIP we will interact with GeoSMART collaborators and other ML enthusiasts.
ESIP Lab Support for Tutorials: ML in Earth Science
This special ESIP Lab RFP seeks machine-learning (ML) tutorials related to hydrology, seismology, and the cryosphere. The seed funding is $5,000, or $7,000 if including Geoweaver. Funded tutorials will be included in the GeoSMART curriculum, an NSF-funded initiative to educate the next generation of researchers to use and adopt powerful machine learning tools.
Learn more: esipfed.org/rfp
Q: A domain scientist and an ML expert. What has your collaboration been like?
NC: I study snow cover and that meant working with coarse spatial resolution datasets when I started my research. Now, high spatiotemporal resolution data from newer remote sensing platforms are available and I need to analyze higher volumes of gridded datasets with different methods than the traditional techniques. So the speed and time saved with ML looked like a good option. But that’s not my background.
ZS: ML is my wheelhouse, though. I come from a GIS & computer science background and wanted to use it to study Earth processes. In 2018, I received an ESIP Lab grant to try out an idea and it helped me create Geoweaver, a workflow management system designed to support Earth scientists to automatically preserve their ML workflows, source code, and history.
NC: And that’s how we met through ESIP. We were both leading ESIP Lab incubator projects. My project involved mapping high resolution snow cover areas with ML. Ziheng was developing Geoweaver.
ZS: Now we’re collaborating on the GeoSMART project to help other geoscientists and data scientists leverage Earth science data and ML technology to explore ideas which were impossible before.
Machine Learning for Geoscience: GeoSMART
The GeoScience MAchine Learning Resources and Training (GeoSMART) framework will provide an educational pathway in the foundations of open source scientific ecosystems and progresses through general ML theory, toolkits and deployment on cloud computing. The ultimate goal is offering and sustaining cyber training opportunities.
The project is funded by the National Science Foundation (NSF OAC 2113874).
Q: Let’s talk about ML in Earth science. What are new applications?
ZS: ML actually has been used in Earth system sciences for decades, though its seen some recent advances. Most ML applications didn’t hit the market or were never used by the wider community due to their spatial-temporal limitation and uncertainty as well as less stability and awareness of its internal mechanism. Recently, the huge success in data-intensive ML application in industries has revived the interests in the Earth science community on Artificial Intelligence (AI) technology. We are actively trying out all kinds of new models, algorithms, libraries and hardware to fit them on Earth science datasets. Researchers have many specific challenges related to the complexity and time-space characteristics of Earth datasets and we are actively finding ways to address them. ML in Earth science is pretty much an ongoing work that is still in its early stage.
NC: ML models are increasingly found in applications ranging from pattern identification, time series analysis and prediction (e.g. floods), land cover classification from remote sensing data and creation of derived high spatiotemporal resolution products. Also, ML can be used in combination with physics-based models.
Machine Learning (ML) in Snow, Water, and EarthquakesThis astronaut photograph of the Salish Sea offers a high-level glimpse of Earth's hydrologic, seismic and cryospheric processes. The new ESIP Lab RFP looks to support machine-learning tutorials in these Earth science fields. Credit: NASA
Q: What is the main challenge for ML in Earth science data?
NC: A lot of time is spent on getting data ready in ML-ready format. Labeling of unstructured data. For more computationally intensive applications, cloud literacy may also be needed. Limited benchmark datasets and pretrained models.
ZS: Yes, the data preparation part is probably the most time consuming step (60-70%) in the entire ML pipeline. It also leads to another non-trivial challenge: full stack AI workflow sharing. When people share their ML source code, training data preparation is usually overlooked. It results in a vast disaster of duplicated efforts happening across universities and companies, and massive amounts of hours are wasted on tackling the same AI data preparation problems. We are witnessing such a situation happening in every corner of all Earth science departments. Orchestrating and sharing the full stack AI workflow, instead of just some ML notebooks on toy training datasets, is the key to address this challenge.
We published a review paper in Computers and Geosciences that covers more challenges and opportunities.
Q: What is Geoweaver?
ZS: Geoweaver is an open source workflow management system for data scientists, mostly python and shell scripting users. It can record the history of every model run with both the source code and the output logs saved in a local database (two portable files under your home directory). When you export and share the workflow (comprising shell, python, notebooks), all the history will be shared along with it. The people who receive Geoweaver workflow packages can easily see all the struggles you have been through. With Geoweaver, the people who see your work will understand and reuse your data science experiments much easier and faster than just handing them the barebone source code or notebooks.
Q: What’s your favorite part of Geoweaver? What makes a good ML workflow?
NC: With Geoweaver, ML workflows are tractable and reproducible. Visualization of the workflow elements, data and code management.
ZS: Using Geoweaver is a long-term investment. My favorite analogy is like going to the dentist and the doctor tells you that there is plaque under the gum line. You can choose to deep clean it now, keep up your flossing routine, or maybe you’ll lose the tooth in ten years. In Earth AI use cases, ML workflow is that tooth, plaques are all kinds of reproducibility issues in that workflow (incompatibility, underlying library upgrading, API changes, data availability), and Geoweaver is the dentist. There will always be something that will make your workflow fail in future. A very common example: Python upgrades away from your old version. You can choose to use Geoweaver to orchestrate the entire workflow and record all the history code and logs, or you will lose all of them in one year (or sooner, e.g., I will forget all of them in three months).
Some example Geoweaver workflows:
Q: Earth science data is inherently interdisciplinary. Any advice for Earth scientists diving into data or data scientists looking at ML in Earth science processes?
NC: There is a shortage of professionals with the powerful combination of Earth and data science skills that can accelerate discovery in geosciences. I highly encourage data scientists to try to experiment more with Earth science data, and Earth scientists to learn data science workflows and toolkits. More specifically, there is a dire need for Earth science ML specialists.
ZS: Earth scientists should determine their strategy based on their technical background. Right now, most AI experiments rely on python programming, which might be challenging for people new to the field. For them, lower technical barrier solutions such as AutoML and less-coder software are recommended. Compared to ML programming, less-coder software might be less flexible but will save a tremendous amount of time on learning technology and produce a stable environment for benchmarking.
Geoweaver streamlines ML in Earth science
Initially funded with $10,000, Geoweaver went on to become an 2018 National Geospatial Intelligence Agency/InnoCentive Challenge Winner and received $900,000 last year through the NASA Advancing Collaborative Connections for Earth System Science (ACCESS) Program. ESIP Lab is currently offering an additional $2,000 of funding for projects using Geoweaver to compose their full-stack ML workflow.
It is extremely rare to find individuals with all the necessary skills to fully understand and apply machine learning to a science question.
The ESIP Lab doesn't seek the mythical do-it-all scientists. Rather, we foster collaboration.
As PIs funded through the ESIP Lab, Nicoleta and Ziheng — both highly-competent in their own domain areas — have formed a collaboration that has enabled them to push the bounds of their own productivity and scientific curiosity.
Q: How will GeoSMART roll out?
ZS: The GeoSMART project started in September 2021 and will last at least three years. We are preparing for the first mini training week and gathering all the ML materials for curriculum development.
NC: A lack of classroom Earth science ML curriculum and publicly available examples for easy adoption of ML methods motivated us to develop GeoSMART. This year we are establishing the project website infrastructure and some content and calling out the community to contribute like in the ESIP Lab RFP. Next, with the support of NSF and the University of Washington eScience Institute, we will follow with two immersive hands-on training events (hackweeks) over the next two years. We will add content to the website continuously throughout the duration of the project. Our collaborators are UW scientists Marine Denolle, Anthony Arendt and Scott Henderson.
Monitoring snowcover with cubesats and machine learning
In 2019, Cristea led an ESIP Lab-funded team from the eScience Institute at the University of Washington to hone a workflow for an ML algorithm implementation to derive snow-covered areas from Planet Labs imagery.
Earth Science Information Partners (ESIP) is a 501(c)(3) nonprofit supported by NASA, NOAA, USGS and 130+ member organizations.
Nicoleta Cristea, PhD, Research Hydrologist in the eScience Institute at the University of Washington.
Ziheng (Jensen) Sun, PhD, Research Assistant Professor in the College of Science at George Mason University.