ESIP Lab – RFP now openThis year’s theme is climate resilience.

Utilizing Ontologies for the Automatic Generation of RDB Infrastructure – Part 1

I’ll never forget how it felt in the first few months working at the Nevada Research Data Center (NRDC). As a newly minted master’s student with only a minor in my chosen field of computer science I was all but certain that I was there by mistake. I barely knew C++, no Python, a handful of JavaScript and I was tasked with re-constructing the front end of a metadata management application. Our chief stakeholder, Scotty Strachan (of the Envirosensing Cluster), informed me and my associate that the current layout of our application (from the iteration prior to my tenure) did not properly reflect the organization of our research sites.

At that time, the Quality Assurance(QA) app enabled the viewing and editing of metadata related to multiple levels of a research site hierarchy. The hierarchy — starting with a “Site Network” as a parent to a more granular category: “Sites”, which in turn is a parent to a more granular category:“Systems” and on down to individual sensors —  was not interactive as a hierarchy in the application. Each categorical tier was listed en masse with no clear association with it’s parent entity. If you wanted to modify the metadata describing a specific sensor at a specific site you would just have to know how to pick out this singular component out of a list of hundreds across dozens of geographically disparate sites.

At the time, I understood the inconvenience of the design but not the underlying thought and planning which went into such an organization. Reflecting back, it now seems silly to me that the original design did not reflect the reality of how these research sites were organized. But, from the perspective of first-month-grad-student Connor, I am able to glean that the problem was one of semantics. Neither me, nor my associate, were able to parse an effective design for this application because we failed to understand the underlying meaning of what these terms referenced. We lacked a clearly codified ontology to describe a vocabulary of terms outside our domain. We had the vocabulary but no semantics. This is what I hope to change for future NRDC developers.

In this multipart series, I will explore the process of formalizing a semi-formal ontology currently in use by the NRDC (with sufficient detail that others can use my research practically). I will also explore the development of a prototype system which will use this formal ontology to dynamically generate database tables, query-able microservices. Additionally, possible user friendly interfaces enabling quick translation of natural language descriptions  into ontologies will be considered.

Starting off, let’s clarify what an ontology is and the value to the earth related information sciences. In the context relevant to ESIP (and likely very familiar to many of you), ontologies are essentially a means to describe sets of objects and concepts in a specific domain. Additionally, ontologies also provide a formal structure to describe relations between these objects and by what mechanism they are related. Several standard languages exist to describe ontologies “programmatically”, but the most common and powerful for data storage are RDF (Relational Data Framework), RDFS(Relational Data Framework Schema) and OWL(Web Ontology Language).

These three languages together can describe any complex or simple objects and relations between objects in standardized, machine readable syntaxes. Using Uniform Resource Indicators (URIs), these languages provide a toolset that binds together otherwise incompatible and inaccessible data values which are linked through natural language semantics. At the highest level OWL, RDFS and RDF, help to make a more naturally organized and clearly linked infrastructure connecting all resource on the internet. On smaller scales, as with my use case, they can also provide individual projects with a formally defined, standardized, machine readable, breakdown of descriptive metadata defining otherwise hard-to-translate semantics. This formally defined ontology can help speedup and standaradize the development of related cyberinfrastructure projects on a project.

To illustrate the point of how ontologies can help to do this, consider the scenario depicted by the following image:

A diagram depicting three people icons over three different computer icons.

This picture depicts a common scenario: three different software applications are being built, all querying the same data from the same backend. Each application is using the data in different ways, but still requires some common data from a centralized database. To make the example more concrete let's imagine that all three developers are developing a similar hierarchical navigation interface to get to a specific datastream. Without any formally defined ontology each developer has to create a custom data structure, in their own language, that organizes and relates these hierarchical navigation views to each other. These data structures can have wildly different organizations, from series of lists to complex classes, with no standard names — what one developer might call an “Organizational Tier” another might call a “Navigation Level.”

While enhanced communication between developers can help to mitigate these problems, in research settings this can be very difficult. Oftentimes these three applications are built at different times, long after graduate developers or post-docs have left an institution. Or, even if the developer between two of these projects is the same person, he or she may decide to modify an inefficient data structure used last time or inaccurate vocabularies. A formalized ontology can help with this significantly.

This scenario depicts a much simpler means for our developers to develop uniform and interoperable interfaces to the same endpoints. With a single shared ontology, regardless of the programming language used to build a interface, the developers know what APIs and endpoints are available, what they are called, how the queryable data is organized and they now have a shared vocabulary to develop with. By requiring a clearly defined ontology for development, it also forces developers and project leaders to sit down and come to a clear consensus on vocabulary–hopefully ensuring that the best possible vocabulary is used to describe commonly occurring entities in the domain being developed for.

Now, the benefits I outlined in this particular example are dependent on a somewhat cumbersome task. When the backend database tables and services providing endpoints already exist we have to configure our ontology around two existing object relations. First, our ontology has to accurately describe the real world objects we are trying to display in our interface. Second, we must also find a way to describe the abstracted relationships of our data as tables and services. These two requirements can be at odds with the goal of taking a single domain vocabulary and using it to make development easier. So, to better facilitate the use of ontologies as a tool in software development of this variety, we remove the second requirement by starting with an ontology and building our backend from that.

Now, this is where the details of a code generating prototype comes in. Unfortunately, these details will have to wait, as this post will suffice as a high level introduction to the domain of semantics, the problem and motivation driving this solution. Look forward to a follow up post with many more details later this month after the ESIP summer meeting!