Raw data & code
+downloadable data available below the interactive table
In order to improve data discoverablity within the Arctic Data Center (ADC), the datateam is beginning to incorporate the addition of semantic annotations into the data curation process. Doing so offers a way to standardize the diverse descriptions of data used by researchers across disciplines by attaching terms from controlled vocabularies. The use of semantic annotations provides not only definitions of concepts, but also shows the relationships between different terminology.
Dr. Steven Chong led the first major effort to implement semantic search within the ADC, beginning in 2017, by building out ontological terms pertaining to carbon cycling. You can read more about Dr. Chong’s efforts, the ADC’s semantic search product, and its vision moving forward in this blog post. More recently (as of about August 1, 2020), the datateam began making a second push to add semantic annotations to attributes for all incoming data packages to the ADC.
The ADC datateam is currently instructed to add annotations from four main ontologies (the following text was borrowed from the NCEAS Datateam Training):
Here, I explore the current ADC corpus to summarize our progress in implementing semantic search and identify areas for improvement/further consideration.
How many datapackages have annotations?
How many attributes have been annotated, and when were these added?
Which ontology(ies) are the majority of annotations coming from?
Which annotations are non-resolvable?
See additional details regarding these three non-resolvable URIs in Table 2, below:
The most common semantic annotations used across all ACD metadata records are visualized in Figure 1, below (for sake of space, only terms used more than 20 times are included in Fig.1). These include terms such as soil temperature (used a total of 439 times), relative species abundance (used 305 times), and air temperature (used 283 times). You can explore (and download) the associated data file containing all semantic annotations (i.e. not just those used >20 times) currently included in ADC metadata records in Table 3. Dropping the valueURI into your web browswer will take you to the semantic annotation, where you can learn more about its description and relationship to other terms.
I’ve broken this down a bit further into annotations added prior to August 1, 2020 (i.e. those that were added as a part of Dr. Chong’s efforts; Fig.2a) vs. those added on or after August 1, 2020 (i.e. the more recent additions made by the ADC datateam since incorporating annotations into the data curation workflow; Fig.2b). For example, soil temperature (the most frequently used annotation overall; see Fig.1), was primarily assigned to attributes during Dr. Chong’s efforts and has been used less frequently in the more recent annotation efforts. You can find these corresponding values in Table 3, in the pre-2020-08-01 counts
and post-2020-08-01 counts
columns.
While annotating each and every attribute within a data package is an ultimate goal, it is currently a time-intensive process for the datateam. As such, datateam members target the most “semantically-important” attributes to annotate and leave “less-important” attributes (terms that are less likely to be searched on; e.g. datetime, latitude, longitude) unannotated.
Recall that there are 185 ADC data packages containing annotations, and across those packages, a total of 14718 attributes. While part 3 explores the 12312 attributes that have been semantically annotated, here I summarize the remaining 2406 attributes that did not receive annotations.
A goal is to assess if these non-annotated attributes are (a) terms that datateam members are intentionally skipping for sake of time (e.g. “less semantically important” attributes), (b) terms that really should have an annotation and got skipped accidentally, or (c) terms that got skipped because there is currently no appropriate semantic annotation to describe them.
The most common non-annotated attributeNames (individual tokens) are visualized in Figure 3 and explored in Table 4 below. IMPORTANT NOTE: the attributeNames on the y-axis are actually unnested individual tokens (singular words), meaning all attributeNames were separated into individual words (e.g. sep = " "
) during the text mining process. For example, “depth” (the second most common term; counts = 71), may exist in the ADC corpus as attributeName = “depth”, “soil depth”, “snow depth”, etc. Parsing these will require some further analyses.
I’ve manually assigned terms to some general categories (see legend) – location terms (e.g. “position”, “latitude”, “top”, “site”, etc.), temporal terms (e.g. “date”, “phase”), and QC/Confidence terms are likely being intentionally skipped for sake of time (see instructions in Data Team Training Part 4.8.2 #1), whereas measurements (e.g. “depth”, “height”) and environmental materials (e.g. “soil”, “snow”, “ice”) likely have an appropriate annotation match (ECSO & ENVO) and could be annotated.
Additionally, I’ve asked the datateam to add any attributes that they cannot find an appropriate annotation for to this GoogleDrive Sheet. They’re accumulating very slowly (which is hopefully a good sign that most attributes have an appropriate annotation match).