(series: best practices and solution choices for knowledge graphs)
The question posed here concerns the choice of domain to define uris and how to ensure expressive dereferencing of uris. It is a question of exploring the consequences of the deep link that links uris and what can be done at the level of dereferencing. It is also about how to ensure this expressiveness while pursuing the goals of Linked Open Data (LOD).
Let's start with a recommendation of the LOD: ensure links between knowledge graphs by reusing URIs already defined and used in other datasets. Let's take an example of a fact that we would like to express in our dataset on works of art, exploiting the prefix http://exemple.org/kg/ that we will abbreviate by my: : (work X, has for creator, Victor Hugo). A naive implementation of this recommendation would lead to:
- define a URI in our domain for 'work X' as my:5212,
- look for an existing URI to express the property 'has as creator'; as soon as we look into the existing one, we find the reference to the Dublin Core in which the property dcterms:creator is defined; another way to find it is to use the Linked Open Vocabularies (LOV) site and to do a search on creator, the result shows that dcterms:creator seems to be the most appropriate,
- now, we want to find a URI for Victor Hugo; here again, we quickly find that there are general knowledge graphs like DBPedia, Wikidata and Yago; for example, Wikidata presents a search interface (https://www.wikidata.org/wiki/Wikidata:Main_Page) in which we can do a search on 'Victor Hugo' which quickly leads us to https://www.wikidata.org/entity/Q535 (actually https://www.wikidata.org/wiki/Q535 which is the HTML page associated to the entity https://www.wikidata.org/entity/Q535)
This would give us the triple:
my:5212 dcterms:creator wd:Q535
(note: some of the prefixes used can be explained with the https://prefix.cc/ service)
The work has an identifier in my knowledge graph, and has a property from a vocabulary defined elsewhere and a value for this property defined elsewhere. I have thus created links with other datasets.
What are the drawbacks of this naive approach?
First of all, it constitutes weak points for our graph. Indeed, let's admit that the exploitation of this one depends on properties of the entity wd:Q535, at least the associated label. If the external dataset were to be temporarily unavailable or substantially modified, the exploitation of our graph would be compromised. The more different datasets would be referenced in our graph, the more fragile would be the exploitation of it.
Then, concerning the dereferencing, the triple above would not be enough to display a descriptive page of the entity in a way understandable by a person. We could add the following triple to it:
wd:Q535 rdfs:label "Victor Hugo"
This will allow us to display in the page a link whose text would be "Victor Hugo" and the associated link "http://www.wikidata.org/entity/Q535". But a click on this link will send us to the corresponding Wikidata page with its own presentation and no reverse link to our graph. Globally, if we want to have a rich dereferencing containing internal links to our graph, we will have to define a URI in our domain for Victor Hugo, for example my:Victor_Hugo.
But, if we want to maintain external links to comply to the LOD principes, we'll have to complete with a triple that ensures this link; we'll then have:
my:5212 dcterms:creator my:Victor_Hugo
my:Victor_Hugo owl:sameAs wd:Q535
my:Victor_Hugo rdfs:label "Victor Hugo
In fact, we could very well with our own URIS, then, later, complete with one or more sameAs:
my:5212 dcterms:creator my:Victor_Hugo
my:Victor_Hugo rdfs:label "Victor Hugo"
The same reasoning could be applied to properties. However, this seems less important as properties like dcterms:creator and rdfs:label are commonly used.
The criticism linked to the dereferencing of a property is illustrated for example by the link dcterms:creator which sends to a description in English which could be inappropriate in the context of the dataset one is creating. Similarly and even more clearly, rdfs:label sends to a page containing a complete ontology in the middle of which the 'label' property is defined, which is not very user-friendly. However, in this case, one can refrain from making this link active in the dereference page; on the other hand, it is relevant to use it because many datasets use it and it is common to try to "understand" an entity URI by looking for the associated label through an rdfs:label property.
Another approach may be to define our own property as equal to or derived from the commonly used property. For example, we would have:
my:label
Various ways can be used to express the relationship between my:label and rdfs:label. For example:
my:label owl:sameAs rdfs:label
or
my:label rdfs:subPropertyOf rdfs:label # which enable us to express differences between the two
or
my:label owl:equivalentProperty rdfs:label
The disadvantage of these approaches is that it makes writing queries on the dataset more complex and requires an understanding of how the dataset is constructed by those who want to exploit it.
This post could lead to long developments. We will limit ourselves. To conclude, we can recommend to start with our own URIs for entities and for properties, except for the properties that we will want to share with a maximum of already well established daatsets and in a second step, to establish links between our URIs and external datasets.
Note: to go further, I would recommend, for example, the article "An Analysis of Links in Wikidata" published in the proceedings of the ESWC 2022 conference.