Mapping the Language of Machines

Megan Carter

Oct 04, 2019

Community Fellows

ESIP Community Fellows, Zachary Robbins and Yuhan Rao, reflect on the Geosemantic Symposium held at the 2019 ESIP Summer Meeting.

With the rapidly growing volume of Earth and environmental data and increasing computational capacity, Machine Learning (ML) and other artificial intelligence technologies have become increasingly popular in Earth and environmental sciences. As ML is strongly data-driven, this rise highlights the natural connection between ML research and the data semantics community.

On, Monday, July 15th, 2019, the ESIP Semantic Technologies Committee and Machine Learning Cluster co-hosted the 3rd ESIP GeoSemantics Symposium co-located with the ESIP 2019 Summer meeting in Tacoma, WA. The theme of this year was “Building Harmony Between Data Semantics and Machine Learning” and speakers from across the ML world discussed how to integrate semantic technologies to accelerate advances in ML.

The field of Data Semantics (DS) provides an additional ‘layer’ to data beyond traditional information modeling practices by linking logical data structures to relations and context. This structuring is present in datasets such as time-based measurements, satellite observations, in-situ field measurements and structured data markup present in websites, to name a few. DS forms the backbone of search engines, metadata structuring, and can be used to define geospatial relationships. This logical structuring is crucially important in the advancement of ML, as it gives computers vocabulary on which to accurately build relational knowledge and statistical relationships. One major problem in ML is a dependency on training data to accurately describe real-world events using statistical models. While humans can use our semantic hierarchies to differentiate context and relationships (for example, using relative size to understand depth), algorithms that drive ML are dependent on data encodings and how well these data represent the characteristics to be learned. Increasing the amount of knowledge that a dataset conveys by improving data semantics is crucial to creating better workflows within ML.

For ML algorithms to understand the context of a dataset, controlled vocabularies are crucially important. Controlled vocabularies are a way to standardize the meaning of objects entering the ML environment. In the geospatial data context, knowing the connections between data sources, different spatial metrics, or geospatial locations is crucial in making spatial or temporal inference. Researchers are trained, for example, to understand that gridded data is raster data, or that temperature is synonymous to a thermal reading. Using multiple terms for the same idea, however, may confuse the process of proper ML. Controlled vocabularies give a mechanism to standardize this process.

At the GeoSemantics Symposium, Dr. Simon Cox (CSIRO) spoke on building a common vocabulary for data collected by ecological research sites. These sites often synthesize multiple disciplines and can present problems when controlled vocabulary is unavailable. For example, atmospheric scientists and ecologists would not use the same description for similar phenomena but their results must be cataloged in a way that displays their interrelation. Dr. Cox presented a framework for connecting the description of ecological site observation(s) to existing vocabularies. Given the large number of tools that can be used to gather site measurements, a large portion of this work is understanding the parameters of devices and the units they derive. Building semantic relationships between data collection tools, their parameters, and the data created is crucial to drawing accurate inference from ML.

Ground truthing is a ML practice, common applied within Earth Science, which benefits from strong data semantics. In ground truthing, datasets are classified to provide the “true” measurement used in training and testing stages. Ground truthing could determine whether an image contains an/some objects of interest (like reCAPTCHA). More complex examples may involve many different objects, all which must have their well-defined meanings. This can be represented through use of ontology. In this example, an ontology would enable one to model (i) identified object(s) within a dataset, (ii) connections and relationships between objects of the same type and of different types in a dataset, and (iii) within a domain.

Justin Thomas and Kunal Sengupta (Amazon Web Services – AWS) presented the ground truthing capability within AWS SageMaker. Attendees were directed to set up workgroups which provide the capability to ground truth datasets. Workgroups can be set up as a workflow for semantic labeling of datasets and ground truthing, which integrate into a ML workflow. The audience at large drew consensus around how to natively integrate semantics into ML workflows. This will be the focus of future ML research.

Semantic-oriented ground truthing may also be a task that ML itself can improve. In his Data Analytics for Canadian Climate Services (DACCS) presentation, Dr. Jean-Francois Rajotte (CRIM) demonstrated how ML can be used to encode metadata of existing datasets by using Natural Language Understanding (NLU) tools. NLU is employed to extract information from relevant documents, codes, and other available resources. By formalizing the relationships within data, NLU tools have the potential to semantically enhance ML datasets moving forward. However, this process relies on having application-specific ontologies by which to dataset classification can be adequately achieved..

Beyond technological challenges and advancements, the democratization of ML was also a topic that was touched on by multiple speakers. When algorithms are created to classify/evaluate human activities that have human consequences, people must have an active way to understand and provide critiques of the results created by machines. This is equally true of the expressive semantics which underpin the datasets in question. Ascribing semantic meaning to data can have broad impacts in the Earth science realm and, for that reason, the community should drive for public access to processes, techniques and datasets for use in ML applications.

The 3rd ESIP GeoSemantics Symposium helped to cultivate and develop the data semantics and ML communities' shared interests in creating harmony that advances the goals of each study. In 2020, both communities will pursue a similar agenda and plan to co-host two similar events within the year. While the Symposium highlighted use cases and advancements across both disciplines, future events need to focus on solid advancement in both fields. These will be strictly hands-on events where groups will work collaboratively on real-world problems. For example, the ML cluster is working on creating a repository of training datasets for Earth science oriented ML applications and the Semantic Technologies Committee will provide the rich data semantics for use in the repository.

—

If you are interested in what the Semantic Technologies Committee is working on, please join our email list at http://wiki.esipfed.org/index.php/Semantic_Technologies and/or attend our meeting on the 4th Tuesday of every month at 4 pm Eastern. (GoToMeeting link: https://global.gotomeeting.com/join/976796333)

Learn more about the Machine Learning Cluster at http://wiki.esipfed.org/index.php/Machine_Learning. You can join the Machine Learning Cluster’s monthly meeting on the 3rd Friday of each month at 12 PM Eastern time (GoToMeeting link: https://global.gotomeeting.com/join/422305101).

More about Zachary: Zachary is a PhD student at North Carolina State University. For his PhD, he plans to develop models to forecast insect disturbance and forest biogeochemical states. This work relies on creating data science tools to integrate climate models, soils inventories, atmospheric deposition data, and forest imputations to generate a multivariate analysis of when and where outbreaks are likely to occur. The data that drives these models comes from many of the organizational members of ESIP. Zachary is working with the Semantic Technologies Committee.

More about Yuhan: Yuhan recently became a postdoctoral research scholar at North Carolina Institute for Climate Studies of North Carolina State University. His research focus is on leveraging earth observations and machine learning methods to advance understanding of surface temperature change in recent decades. Yuhan is also interested in science communication and data visualization. For 2019, Yuhan is working with the Machine Learning Cluster.

Share This Post