Data&Musée

Explorer les données de l'héritage culturel français

Creating knowledge from WikidataJoconde with Amie3: step 1, establishing rules

We saw in the post

https://datamusee.wp.imt.fr/2023/07/25/creation-dun-dump-wikidata-de-la-base-joconde/

how a Wikidata dump of the description of artworks in the Joconde database was created, which I've named WikidataJoconde. This dump can be accessed on Zenodo at url:

https://zenodo.org/record/7941537

WikidataJoconde contains 484207 triples for 28524 entities using 370 properties. The entities are mainly of the painting type (Q3305213; 14491 occurrences). A total of 18099 Joconde artworks are described in this dataset.

I'm going to analyze this dump with Amie3 to produce new knowledge -rules in the sense of AMIE- deduced from the dump. Amie3 is a tool that analyzes a set of triples (RDF) to deduce rules for inferring new triples from the presence of other triples.

A presentation of results obtained with AMIE3 is available in the following article:

Fast and Exact Rule Mining with AMIE3

To use WikidataJoconde in nt format with AMIE, you need to transform it into a tsv file. The general principle is simple: for a line containing a triple such as

<http://www.wikidata.org/entity/Q117600569>	<http://www.wikidata.org/prop/direct/P195>	<http://www.wikidata.org/entity/Q3044753> .

remove the period and replace the space between triple elements with a tab.

java -jar amie_plus.jar "ici le chemin vers le fichier tsv" >ici le chemin vers le fichier résultat

gives 94 'rules', each accompanied by a set of values to help assess its relevance (which I won't comment on here). For example:

?b  <http://www.wikidata.org/prop/direct/P361>  ?a   => 
?a  <http://www.wikidata.org/prop/direct/P527>  ?b	
0,394495413	0,262195122	0,988505747	86	328	87	?b	0.0	0.0	0.0

P361 is the Wikidata 'part of' property and P527 the 'has part(s)' property.

We can see that AMIE3 has deduced the symmetry between these two properties. The rule thus produced could feed a set of SHACL rules or an ontology about the dataset. It could be translated as 'if ?b is a part of ?a, then ?a has part ?b'.

In these rules, 35 properties are involved.

Analysis of the 94 rules obtained reveals different rule natures. We'll comment on a few of them in the rest of this post.

Rules where one triple imply another

Some rules reveal that one property may be the inverse of another:

  • "derivative work" is the inverse of "based on" (*)
  • "has part(s)" is the inverse of "part of" (*)

Some rules reveal that one property may be a precision of another:

  • "part of the series" précise "part of" (*)
  • "published in" implies "catalog"; this rule is false; the opposite could be envisaged: the fact that ?a is published in ?b, does not imply that ?a is in the ?b catalog, because ?b is not necessarily a catalog; whereas the fact that ?a is in the ?b catalog does imply that ?a is published in ?b
  • "depicts" implies "main subject"; this rule is false; the opposite could be envisaged: the fact that ?a is described by ?b does not imply that ?a has ?b as its main subject; whereas the fact that ?a has ?b as its main subject does imply that ?a is described by ?b.

Some rules show that a property could be symmetrical

  • "pendant of" (*) indicating that one work is the counterpart of another in a set of artworks,
  • "different from"

Some rules show that if a property connects ?a to ?b, then ?a is different from ?b:

  • if ?a is based on ?b, then ?a is different from ?b,
  • if ?a is derivative work of ?b, then ?a is different from ?b,

Some rules show that two properties could be equivalent:

  • 'country' and 'country of origin'
  • 'image' and 'image with frame'

In fact, in both cases, the second property should be seen as a precision (restriction) of the first. One of the two rules inducing equivalence is therefore exact, while the other is not.

The rules marked with a star are relevant and could be expressed in an ontology.

Some rules refer to discussions in the Wikidata editing community.

For example, one rule suggests symmetry between 'country' and 'country of origin'. In fact, 'country of origin' could be seen as a restriction of 'country'. But 'country' is insufficiently defined; it is sometimes used in the sense of 'country of origin', but sometimes it refers to the country where the work is kept. There may be other interpretations of the 'country' property: country represented in the work? country where the event represented in the work takes place?…

One rule states that

if ?a is the following of ?b in a series (followed by), then ?a is a derivative work of ?b.

This rule is false; it could reveal an approximate use of these properties in our dataset.

Two rules lead us to consider the properties 'location' (used 17589 times) and 'collection' (used 19912 times) as interchangeable. location' is described by wikidata as 'location of the object, structure or event' and takes values such as 'Department of Paintings of the Louvre', 'Musée d'Orsay', 'Unterlinden Museum' and 'National Archaeological Museum'. collection' is described as 'collection of works of art, museum, or bibliography of which the subject is a part' and takes values such as 'Room 702', 'France', 'Notre-Dame de Paris' and 'Condé Museum'. There are only 247 common values used for any of the properties, out of over 30,000 possible values. This shows that these properties are not interchangeable, even though they cover correlated concepts. The rules are probably due to the fact that 8981 entities use both properties with the same value; in fact, this high number is linked to the fact that the collection to which a work belongs is often the collection of the museum where the work is located: in this case, both properties can take the same value. However, for 'location', the recommendation of wikidata editors is to provide the most precise location possible: this is illustrated by the value above, Room 702, which is in fact a room in the Louvre museum. When we look at the values associated with these rules, we find a support of 9000 (number of entities used to establish the rules), and a PCA confidence of 0.512 for one and 0.452 for the other.

Two rules lead us to consider the properties 'time period' (used 286 times) and 'movement' (used 1454 times) as interchangeable. time period' is described by wikidata as 'period (historical epoch or era, sporting or theatrical season, legislative period etc.) in which the subject appears' and takes values such as 'Roman Empire', 'Ancient Rome', 'Hellenistic period' or 'late antiquity'. movement' is described as 'literary, artistic or philosophical movement associated with this person or work' and takes values such as 'academic art', 'Roman sculpture', 'neoclassicism' and 'Dutch Golden Age painting'. There are only 7 common values used for any of the properties: 'High Renaissance' (231 times), 'Romanticism' (416 times), 'Early Renaissance' (14 times), 'Baroque' (413 times), 'Dutch Golden Age painting' (76 times), 'mannerism' (18 times), 'Renaissance' (4 times). This shows that these properties are not interchangeable, even though they cover correlated concepts. The rules are probably due to the fact that 19 entities use the two properties with the same value; when we look at the values associated with these rules, we find a support of 19 (number of entities used to establish the rules), but a relatively low PCA confidence: 0.283 for one and 0.268 for the other.

Rules where two triplets imply a third

Several rules are based on the following model, with different pr properties:

if ?a pr ?f and ?b pr ?f then ?a 'different from' ?b

This can be considered true if, by construction, AMIE3 eliminates the case where ?a is identical to ?b. I haven't checked this and consider these rules to be false. In any case, they don't really provide any information.

Some rules are based on the 'different from' property and draw erroneous conclusions. For example, one rule states:

if ?f is based on ?b and ?a different from ?f then ?a is based on ?b

This rule is obviously not semantically relevant.

Overall, rules that use 'different from' in premises and not in conclusions are wrong (14 cases). Those that use 'different from' in both premises and conclusions are correct, unless both premises use 'different from'; however, when there are two premises, the one not using 'different from' adds nothing to the rule. A similar rule, but with only the premise using 'different from', exists and brings the same conclusion.

Two rules are equivalent and correspond to realities that are often true:

if ?a pendant of ?b (in a set of artworks) and ?a is made from material ?c, then ?b is (probably) made from material ?c

There are other similar pairs of rules with a premise containing the 'pendant of' property and another property, the latter propagating to the work linked by 'pendant of'; these are the properties: 'copyright status', 'country of origin', 'instance of', 'movement', 'commisioned by', 'country', 'catalog', 'collection', 'creator', 'owned by', 'location', 'genre', 'depicted format' (the accuracy of this one is uncertain). This makes 28 rules that can be considered accurate.

Within the scope of these rules, there are 225 entities linked by the 'pendant of' property. Linking these pairs of entities, we find 48 properties that can take a different value for an entity and its 'sister'. For certain properties, such as 'label' or 'description', it's understandable that the values differ. A more detailed analysis of the properties that could give rise to a transfer from one entity to another is beyond the scope of this post.

(note: see how many works would be involved and whether this would enrich the dataset)

A few rules that must be challenged

  • if ?a different from ?b and ?b different from ?a, ?a derivative from ?b

A similar set of rules:

  • if ?b part of ?f and ?f has part ?a, a rule proposes that ?a different from ?b
  • if ?a part of ?f and ?b part of ?f, a rule proposes that ?a different from ?b

Overall results

With the default threshold used by AMIE3, we obtain 94 rules, of which our analysis shows that 46 are correct. If we consider only rules with a PCA confidence greater than or equal to 0.6, we have 60 rules, of which 43 are considered correct.

We need to find a general method for eliminating false rules. In the meantime, we can make an initial assessment by eliminating rules with 'different from' or 'image' in the premises. This leaves 46 rules, 43 of which are considered correct. It should be noted that these same rules can be applied to all the works described in Wikidata, as they are generic in nature. This will be the subject of further work.

A future post will look at more generic rule filtering methods.

(a spreadsheet to work about this post can be seen here: https://docs.google.com/spreadsheets/d/1-yvArUrA4XIL5HqwzOUOzyJy9fHkNrbNVtTebJQ32WQ/edit?usp=sharing)

Author: Moissinac

Maitre de conférence à Télécom Paris, Département Image, Données, Signal - Groupe Multimédia Jean-Claude Moissinac a mené des recherches sur les techniques avancées pour la production, le transport, la représentation et l’utilisation des documents multimédia. Ces travaux d'abord ont évolué vers la représentation sémantique de données liées au multimédia (process de traitement de médias, description d'adaptations de média, description formelle d'interactions utilisateurs). Aujourd'hui, les travaux portent sur la constitution de graphes de connaissances. Principaux axes de recherche actuel : représentations sémantiques de connaissances, constitution de graphes de connaissances, techniques d'apprentissage automatique sur ces graphes

Comments are closed.