Decoration
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

LEILA: Learning to Extract Information by Linguistic Analysis

 Research   Downloads   Corpora   Publications   People 

 

What this is

This is a set of corpora for relation extraction. Relation extraction is the task of, given a semantic target relation and given a natural language corpus, extracting all pairs of entities in the corpus that stand in the target relation. For example, if the target relation is instanceOf and the corpus contains the following passage

"President Mickey M. Mouse was happy to visit the city of Washington D.C., which is the capital of the United States."

then the goal is to extract the following pairs:

instanceOf
Mickey M. Mousepresident
Washington D.C.city
Washington D.C.captial

This web site provides corpora for evaluating Relation Extraction systems. For each document in the corpus, we provide a list of manually extracted ideal pairs that should be extracted by the system. Note that these pairs are not linked to the original sentence, but only to the document. The corpora were used with LEILA.

 

What types of files we have

htmlthe original document
lgithe proper sentences of the original document (Link Grammar Input), as extracted by HTML2LGI.java
llthe non-grammatical parts of the sentences of the original document, as extracted by HTML2LGI.java
lgothe parsed version of the proper sentences (Link Grammar Output), as produced by LGParse.java by calling the Link Grammar Parser
inst/birt/synthe manually extracted ideal pairs, as produced by human annotatators with HandTag.java.
inst-files contain instanceOf-pairs, birt-files contain person/birthdate-pairs and syn-files contain synonymy pairs.

The manually extracted pairs are licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use them, please cite our paper. The other files are licensed under the GNU Free Documentation License, unless they underly different terms by their authors.  

Corpora

Corpus # Docs Relation # annotated Remarks
Googlecomposers492instanceOf100 We used Google to search for the baroque, classical and romantic composers of Wikipedia. We downloaded the first page in the result list (using the "I'm feeling lucky" button) excluding Wikipedia pages. This corpus is highly incoherent, containing advertisements as well as pages with no proper sentences at all
Wikicomposers872instanceOf, person/birthdate87 All Wikipedia articles about composers
Wikigeography313synonymy130 All Wikipedia articles about the geography of countries
Wikigeneral (-)223instanceOf223 Some random Wikipedia articles.