AMIE: Association Rule Mining under Incomplete Evidence in Ontological Knowledge Bases

Motivation Results People Publications Downloads

Runtime information

AMIE can extract closed horn rules from medium-sized ontologies in a few minutes.

Dataset	# of facts	Threshold	Latest runtime	Rules
YAGO2	948048	Head coverage 0.01	3.62 min	Sorted by: Std. Confidence, PCA Confidence
YAGO2	948048	Support 2 facts	4.56 min	All rules
YAGO2 sample	46654	Support 2 facts	5.41s	Sorted by PCA confidence
YAGO2 with constants	948048	Head coverage 0.01	17.76 min	Some interesting examples
DBpedia 2.0	6704524	Head coverage 0.01	2.89 min	Rules up to 2 atoms

Knowledge bases

YAGO2

YAGO is a semantic knowledge base derived from Wikipedia, WordNet and GeoNames. The latest version, YAGO2s, contains 120M facts describing properties of 10M different entities. Since the rules output by AMIE are used for prediction, we used the previous version, YAGO2 (released in 2010), to predict facts in YAGO2s. YAGO2 contains 120M facts about 2.6M entities. For both versions of the ontology we did not consider either facts with literal objects or any type of schema information (rdf:type statements, relation signatures and descriptions). For YAGO2s, this is equivalent to use the file yagoFacts with around 4M triples. For YAGO2 we use the file yagocore which contains 948K facts after cleaning. The clean testing versions of [YAGO2] and [YAGO2s] are available for download.

YAGO2 sample

Our experiments included comparisons against state-of-the-art systems which could not handle even our clean version of YAGO2. For this reason, we built a sample of this KB by randomly picking 10K entities and collecting their 3 hops subgraphs. In contrast to a random sample of facts, this method preserves the original graph topology. This procedure resulted in a [47K facts sample].

DBpedia

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia. The English version of DBpedia contains 1.89 billion facts about 2.45M entities. In the spirit of our data prediction endevours, we mined rules from DBpedia 2.0 to predict facts in the latest version 3.8 (in English). In both cases we used the person data and infoboxes datasets and removed facts with literal objects and rdf:type statements. This produced a clean subset of 6M facts for [DBpedia 2.0] and 12M for [DBpedia 3.8].

Data prediction

Experimental setup

In order to support the suitability of the PCA Confidence metric for prediction of new facts, we carried an experiment which uses the rules mined on YAGO2 (training KB) to predict facts in the newer YAGO2s (target KB). We took all rules mined by AMIE with head coverage threshold 0.01 and ranked them by standard and PCA confidence. Then we took every rule and generated new facts by taking all bindings of the head variables in the body of the rule which are not in the head (sets B, C and D in our mining model). For instance, for the rule ?s <directed> ?o => ?s <created> ?o, we produce predictions of type A <created> B where A and B correspond to bindings of people and films in the <directed> relation (body of the rule) which are not in the <created> relation. This corresponds exactly to the bindings which are beyond the training KB YAGO2.

A fact can be predicted from more than one rule if they share the same head relation. For this reason, we went down in the ranking and for every rule, we removed all predictions that were produced from previous rules. From the remaining predictions, we took a sample of 30 facts and evaluated them automatically in YAGO2s or manually by checking the information in Wikipedia. Automatic evaluation was used if (a) the prediction is in YAGO2s or (b) it violates a functionality constraint (predicting a second death place for a person) in any of the datasets.

Experiments

Predictiveness comparison between standard and PCA confidence
Predictiveness comparison between AMIE and ALEPH

Downloads

All input and output files can be downloaded here