Arctic Data Center Semantic Annotation Review, 2020

Outline:

ADC semantic annotation overview/goals
Summary of annotations efforts, to date
- Table 1: annotation counts, listed by ontology/source
- Table 2⁺: attribute information for non-resolvable valueURIs
Summary of most commonly used semantic annotations
- Figure 1: most commonly used attribute-level annotations across the ADC corpus
- Figure 2: most common attribute-level annotations used in data packages uploaded pre vs. post-August 2020
- Table 3⁺: counts of (i) attribute-level annotations used across the ADC corpus (total, pre-Aug. 2020, post-Aug. 2020), (ii) unique package identifiers each annotation is used in, and (iii) unique authors that used each annotation
Summary of attributes that are not being annotated
- Figure 3: most common nonannotated attributeNames (unnested, individual tokens)
- Table 4⁺: most common nonannoated individual attributeName tokens
Raw data & code

⁺downloadable data available below the interactive table

1. Overview

In order to improve data discoverablity within the Arctic Data Center (ADC), the datateam is beginning to incorporate the addition of semantic annotations into the data curation process. Doing so offers a way to standardize the diverse descriptions of data used by researchers across disciplines by attaching terms from controlled vocabularies. The use of semantic annotations provides not only definitions of concepts, but also shows the relationships between different terminology.

Dr. Steven Chong led the first major effort to implement semantic search within the ADC, beginning in 2017, by building out ontological terms pertaining to carbon cycling. You can read more about Dr. Chong’s efforts, the ADC’s semantic search product, and its vision moving forward in this blog post. More recently (as of about August 1, 2020), the datateam began making a second push to add semantic annotations to attributes for all incoming data packages to the ADC.

The ADC datateam is currently instructed to add annotations from four main ontologies (the following text was borrowed from the NCEAS Datateam Training):

The Ecosystem Ontology (ECSO)
- this was developed at NCEAS, and has many terms that are relevant to ecosystem processes, especially those involving carbon and nutrient cycling
The Environment Ontology (ENVO)
- this is an ontology for the concise, controlled description of environments
National Center for Biotechnology Information (NCBI) Organismal Classification (NCBITAXON)
- The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases.
Information Artifact Ontology (IAO)
- this ontology contains terms related to information entities (eg: journals, articles, datasets, identifiers)

Here, I explore the current ADC corpus to summarize our progress in implementing semantic search and identify areas for improvement/further consideration.

2. Summary of annotation efforts

How many datapackages have annotations?

As of October 12, 2020, the Arctic Data Center contains 6142 data packages (NOTE: a data package consists of a publically-available metadata record, which may be packaged with one or more data files). Of those, 1428 contain data file types that have associated attributes (i.e. variables). Currently, 185 data packages have at least one semantically-annotated attribute.

How many attributes have been annotated, and when were these added?

The majority of attributes in those 185 datapackages are annotated (12312/14718), most of which were added during Dr. Chong’s tenure at the ADC (9802/12312, as compared to the 2510/12312 that have been added by the datateam since August 2020).

Which ontology(ies) are the majority of annotations coming from?

The vast majority of semantic annotations come from The Ecosystem Ontology (ECSO) (12155/12312, or 98.7%). The remaining come from CHEBI, ENVO, and Wikipedia. See details in table below.

Which annotations are non-resolvable?

Notice that 110 annotated attributes in the ADC corpus have non-resolvable valueURIs. Among those are only three unique valueURIs, listed here:

See additional details regarding these three non-resolvable URIs in Table 2, below:

NOTE: You can download Table 2 as a .csv file here

3. Which semantic annotations are most commonly used at the attribute level?

The most common semantic annotations used across all ACD metadata records are visualized in Figure 1, below (for sake of space, only terms used more than 20 times are included in Fig.1). These include terms such as soil temperature (used a total of 439 times), relative species abundance (used 305 times), and air temperature (used 283 times). You can explore (and download) the associated data file containing all semantic annotations (i.e. not just those used >20 times) currently included in ADC metadata records in Table 3. Dropping the valueURI into your web browswer will take you to the semantic annotation, where you can learn more about its description and relationship to other terms.

I’ve broken this down a bit further into annotations added prior to August 1, 2020 (i.e. those that were added as a part of Dr. Chong’s efforts; Fig.2a) vs. those added on or after August 1, 2020 (i.e. the more recent additions made by the ADC datateam since incorporating annotations into the data curation workflow; Fig.2b). For example, soil temperature (the most frequently used annotation overall; see Fig.1), was primarily assigned to attributes during Dr. Chong’s efforts and has been used less frequently in the more recent annotation efforts. You can find these corresponding values in Table 3, in the pre-2020-08-01 counts and post-2020-08-01 counts columns.

NOTE: You can download Table 3 as a .csv file here.

4. Which attributes are not getting annotated?

While annotating each and every attribute within a data package is an ultimate goal, it is currently a time-intensive process for the datateam. As such, datateam members target the most “semantically-important” attributes to annotate and leave “less-important” attributes (terms that are less likely to be searched on; e.g. datetime, latitude, longitude) unannotated.

Recall that there are 185 ADC data packages containing annotations, and across those packages, a total of 14718 attributes. While part 3 explores the 12312 attributes that have been semantically annotated, here I summarize the remaining 2406 attributes that did not receive annotations.

A goal is to assess if these non-annotated attributes are (a) terms that datateam members are intentionally skipping for sake of time (e.g. “less semantically important” attributes), (b) terms that really should have an annotation and got skipped accidentally, or (c) terms that got skipped because there is currently no appropriate semantic annotation to describe them.

The most common non-annotated attributeNames (individual tokens) are visualized in Figure 3 and explored in Table 4 below. IMPORTANT NOTE: the attributeNames on the y-axis are actually unnested individual tokens (singular words), meaning all attributeNames were separated into individual words (e.g. sep = " ") during the text mining process. For example, “depth” (the second most common term; counts = 71), may exist in the ADC corpus as attributeName = “depth”, “soil depth”, “snow depth”, etc. Parsing these will require some further analyses.

I’ve manually assigned terms to some general categories (see legend) – location terms (e.g. “position”, “latitude”, “top”, “site”, etc.), temporal terms (e.g. “date”, “phase”), and QC/Confidence terms are likely being intentionally skipped for sake of time (see instructions in Data Team Training Part 4.8.2 #1), whereas measurements (e.g. “depth”, “height”) and environmental materials (e.g. “soil”, “snow”, “ice”) likely have an appropriate annotation match (ECSO & ENVO) and could be annotated.

NOTE: You can download Table 4 as a .csv file here.

Additionally, I’ve asked the datateam to add any attributes that they cannot find an appropriate annotation for to this GoogleDrive Sheet. They’re accumulating very slowly (which is hopefully a good sign that most attributes have an appropriate annotation match).

5. Raw data & code

Raw data:
- solr query (2020-10-12) is downloadable here
- tidied attribute/annotation information extracted from 2020-10-12 solr query is downloadable here
GitHub Repository with associated code, analyses, and data (this also includes analyses and data not explicitly covered in this report): samanthacsik/NCEAS-DF-semantic-annoations-review