This vignette explains the overall usage of the rosettaR
package, including its core functions and workflows. At its heart,
rosettaR is a templating system with extra window dressing
and rules. This document will walk through the basic usage, starting
with a single template example and expanding to multi-template data
frames.
You can see the original pre-print for the conceptual background to Rosetta Statements: Rosetta Statements: Simplifying FAIR Knowledge Graph Construction with a User-Centered Approach.
Here are some key terms:
Slot - Think of a slot as a blank to be filled in a Mad Lib. These are denoted in the templates using Jinja-style notation, which means that slots will look like this:
{{ object }}Template - A template is a string with slots for particular data types. For example:
{{ object }} has a {{ quality }} of {{ value }} {{ unit }}Statement - A statement is a plain language sentence that contains entries for slots. For example:
Apple X has a weight of 241.68 grams
Templates are made up of slots and semantic information that links
them together. The slots are filled by statements, which can be parsed
by rosettaR into any different data structure.
Single template-statement pair
There are two variables saved in the package for you to work with. This template-statement pair describes the weight of an apple. Let’s declare them here so we can see what is inside:
print(apple_template)
#> [1] "{{ object }} has a {{ quality }} of {{ value }} {{ unit }}"
print(apple_statement)
#> [1] "Apple X has a weight of 241.68 grams"The idea behind rosettaR is that one may want to model
the knowledge contained in this sentence in multiple ways. The
template-statement pair constitutes a core piece of information
(sometimes called a “semantic anchor”) that can be transformed into
multiple representations.
Here we can turn it into a data frame:
knitr::kable(rosetta_format(apple_statement, apple_template))| object | quality | value | unit |
|---|---|---|---|
| Apple X | weight | 241.68 | grams |
That’s a pretty straightforward usage, but we can get more complicated with it. Here we can transform the statement into a CSV, where the slots become column headers and the values go in the columns:
rosetta_format(apple_statement, apple_template,
"Object,Quality,Value,Unit\n
{{ object }},{{ quality }},{{ value }},{{ unit }}")
#> [1] "Object,Quality,Value,Unit\n\n Apple X,weight,241.68,grams"We’re still in pretty simple territory here, so let’s get a little more complex. The original idea behind Rosetta Statements was to make it easier to create knowledge graphs in RDF. So here is an example of an RDF graph that describes the weight of an apple:
apple_rdf <- '@prefix ex: <http://example.org/instance/> .
@prefix ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#> .
@prefix pato: <http://purl.obolibrary.org/obo/PATO_> .
@prefix iao: <http://purl.obolibrary.org/obo/IAO_> .
@prefix obi: <http://purl.obolibrary.org/obo/OBI_> .
@prefix ro: <http://purl.obolibrary.org/obo/RO_> .
@prefix uo: <http://purl.obolibrary.org/obo/UO_> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
############################################################
## OBJECT: the particular apple
############################################################
ex:apple
rdf:type ncit:apple ;
rdfs:label "{{ object }}" ;
ro:has_quality ex:apple{{ quality }} .
############################################################
## QUALITY: weight of the apple
############################################################
ex:apple{{ quality }}
rdf:type pato:{{ quality }} ; # PATO:weight
iao:is_about ex:apple ;
iao:is_quality_measured_as ex:apple{{ quality }}Datum .
############################################################
## MEASUREMENT DATUM
############################################################
ex:apple{{ quality }}Datum
rdf:type iao:0000109 ; # IAO:scalar measurement datum
obi:has_value_specification ex:apple{{ quality }}ValueSpec .
############################################################
## VALUE SPECIFICATION
############################################################
ex:apple{{ quality }}ValueSpec
rdf:type obi:0001938 ; # OBI:scalar value specification
iao:has_measurement_value "{{ value }}"^^xsd:double ;
iao:has_measurement_unit_label uo:0000021 .
############################################################
## UNIT
############################################################
uo:0000021
rdf:type uo:{{ unit }} .'Describing what exactly this TTL does is beyond the scope of this vignette, but if you are reading this it is likely you have some idea about RDF/OWL/etc. So let’s feed that string into our workflow and see what happens:
cat(rosetta_format(apple_statement, apple_template, apple_rdf))
@prefix ex: <http://example.org/instance/> .
@prefix ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#> .
@prefix pato: <http://purl.obolibrary.org/obo/PATO_> .
@prefix iao: <http://purl.obolibrary.org/obo/IAO_> .
@prefix obi: <http://purl.obolibrary.org/obo/OBI_> .
@prefix ro: <http://purl.obolibrary.org/obo/RO_> .
@prefix uo: <http://purl.obolibrary.org/obo/UO_> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
############################################################
## OBJECT: the particular apple
############################################################
ex:apple
rdf:type ncit:apple ;
rdfs:label "Apple X" ;
ro:has_quality ex:appleweight .
############################################################
## QUALITY: weight of the apple
############################################################
ex:appleweight
rdf:type pato:weight ; # PATO:weight
iao:is_about ex:apple ;
iao:is_quality_measured_as ex:appleweightDatum .
############################################################
## MEASUREMENT DATUM
############################################################
ex:appleweightDatum
rdf:type iao:0000109 ; # IAO:scalar measurement datum
obi:has_value_specification ex:appleweightValueSpec .
############################################################
## VALUE SPECIFICATION
############################################################
ex:appleweightValueSpec
rdf:type obi:0001938 ; # OBI:scalar value specification
iao:has_measurement_value "241.68"^^xsd:double ;
iao:has_measurement_unit_label uo:0000021 .
############################################################
## UNIT
############################################################
uo:0000021
rdf:type uo:grams .Processing multiple statements
Doing a single statement is fairly trivial. The real power of Rosetta Statements comes when you have a lot of statements using multiple templates. In order to make this work, there needs to be a library of templates.
This “library” is a simple data frame, but it has to have specific headers. We have built a command that creates one (or loads one from a properly formatted CSV)
# Create an empty library
empty_lib <- init_library()
knitr::kable(empty_lib)| TemplateID | templateText |
|---|
Now let’s load a real set of templates that we have included as an
example using the init_library() function. This function
verifies that the template file is formatted properly:
templates <- init_library(system.file("extdata/apple_templates.csv",
package="rosettaR"))
knitr::kable(templates)| TemplateID | templateText | metaTemplateID |
|---|---|---|
| 1 | {{ object }} has a {{ quality }} of {{ value }} {{ unit }} | NA |
| 2 | {{ object }} was grown in {{ location }} and picked {{ date_picked }} | NA |
| 3 | {{ object }} has the effect of making a human {{ magical_effect }} | NA |
Next, let’s load some statements. Note that these are just plain text strings:
statements <- read.csv(system.file("extdata/apple_statements.csv",
package="rosettaR"))
head(statements)
#> TemplateID statement
#> 1 1 Apple X has a weight of 2332.4 grams
#> 2 1 Apple Y has a weight of 23 stone
#> 3 2 Apple X was grown in Germany and picked 2025
#> 4 2 Apple Y was grown in Canada and picked January 4, 2025
#> 5 3 Apple X has the effect of making a human happy
#> 6 3 Apple Y has the effect of making a human sleepyWe can now use the rosetta_match() function. This
“router” function automatically iterates through your statements, finds
the matching template in your library, and extracts the data.
Unlike the simple rosetta_format which returns a wide dataframe
(columns for variables), rosetta_match returns a Long
Dataframe (Entity-Attribute-Value). This ensures that we don’t get a
“sparse matrix” full of NAs when mixing different templates.
results <- rosetta_match(statements, templates)
knitr::kable(head(results, 10))| statement_id | statement_text | template_id | variable | value |
|---|---|---|---|---|
| 1 | Apple X has a weight of 2332.4 grams | 1 | object | Apple X |
| 1 | Apple X has a weight of 2332.4 grams | 1 | quality | weight |
| 1 | Apple X has a weight of 2332.4 grams | 1 | value | 2332.4 |
| 1 | Apple X has a weight of 2332.4 grams | 1 | unit | grams |
| 2 | Apple Y has a weight of 23 stone | 1 | object | Apple Y |
| 2 | Apple Y has a weight of 23 stone | 1 | quality | weight |
| 2 | Apple Y has a weight of 23 stone | 1 | value | 23 |
| 2 | Apple Y has a weight of 23 stone | 1 | unit | stone |
| 3 | Apple X was grown in Germany and picked 2025 | 2 | object | Apple X |
| 3 | Apple X was grown in Germany and picked 2025 | 2 | location | Germany |
This output structure is ideal for building knowledge graphs, as every row corresponds to a single semantic fact (Subject-Predicate-Object).
Advanced templating
Real-world data is rarely as clean as “Apple 1 has a weight of 100
g”. Often, statements include optional details or citations.
rosettaR supports advanced template syntax to handle these
cases.
Optional information in templates
You can mark parts of a template as optional by enclosing them in
square brackets [...]. If the text inside the brackets is
missing from the statement, the parser will simply skip it (and return
empty values for those slots) rather than throwing an error.
# Template with optional 'region' block
tmpl <- "{{ city }} is located in {{ country }} [in the region of {{ region }}]"
# Statement WITH region
s1 <- "Kitchener is located in Canada in the region of Waterloo"
knitr::kable(rosetta_format(s1, tmpl))| city | country | region |
|---|---|---|
| Kitchener | Canada | Waterloo |
# Statement WITHOUT region
s2 <- "Kitchener is located in Canada"
knitr::kable(rosetta_format(s2, tmpl))| city | country | region |
|---|---|---|
| Kitchener | Canada |
Attribution and sources
A key requirement for scientific knowledge graphs is attribution. You can include citation logic directly in your templates. This makes it explicit which statement is associated with which source.
# Template explicitly asking for a DOI
cited_tmpl <- "{{ city }} is in {{ country }} according to {{ doi }}"
# Input statement
stmt <- "Kitchener is in Canada according to 10.123/example"
# Result captures both the fact AND the source
knitr::kable(rosetta_format(stmt, cited_tmpl))| city | country | doi |
|---|---|---|
| Kitchener | Canada | 10.123/example |
Validation
Extracting data is only half the battle; we also need to ensure it is valid. rosettaR integrates with LinkML (via Python) to validate your extracted data against a schema.
This ensures, for example, that an extracted “Age” is actually a number, or that a “Country” matches a specific list of allowed terms.
Defining a schema
LinkML schemas are defined in YAML. Here is a simple example schema for our Apple data:
schema_yaml <- "
id: [https://example.org/apple-schema](https://example.org/apple-schema)
name: AppleSchema
imports:
- linkml:types
default_range: string
classes:
AppleObservation:
attributes:
object:
required: true
value:
range: float # Value must be a number!
unit:
range: string
"
# Save to a temporary file for this example
schema_file <- tempfile(fileext = ".yaml")
writeLines(schema_yaml, schema_file)Validating data
We can take the output from rosetta_format and check it
against this schema.
# 1. Good Data
good_data <- data.frame(object = "Apple A", value = 150.5, unit = "g")
res_good <- rosetta_validate(good_data,
schema_file,
target_class = "AppleObservation")
print(paste("Good Data is Valid:", res_good$ok))
# 2. Bad Data (Value is a string 'Heavy', not a float)
bad_data <- data.frame(object = "Apple B", value = "Heavy", unit = "g")
res_bad <- rosetta_validate(bad_data,
schema_file,
target_class = "AppleObservation")
print(paste("Bad Data is Valid:", res_bad$ok))
if (!res_bad$ok) {
print(res_bad$issues)
}This validation step acts as a powerful gatekeeper, ensuring that only high-quality, structured data enters your final knowledge graph. # Under construction - much more to come! To do:
- Add slot validation functionality
- Document multiple template/statement pairs and template library functionality
- Fix regex in Rosetta_format
- Fix rosetta_match function