This is a set of corpora for relation extraction. Relation extraction is the task of, given a semantic target relation and given a natural language corpus, extracting all pairs of entities in the corpus that stand in the target relation. For example, if the target relation is instanceOf and the corpus contains the following passage
then the goal is to extract the following pairs:
Mickey M. Mouse | president | |
Washington D.C. | city | |
Washington D.C. | captial |
This web site provides corpora for evaluating Relation Extraction systems. For each document in the corpus, we provide a list of manually extracted ideal pairs that should be extracted by the system. Note that these pairs are not linked to the original sentence, but only to the document. The corpora were used with LEILA.
html | the original document | |
lgi | the proper sentences of the original document (Link Grammar Input), as extracted by HTML2LGI.java | |
ll | the non-grammatical parts of the sentences of the original document, as extracted by HTML2LGI.java | |
lgo | the parsed version of the proper sentences (Link Grammar Output), as produced by LGParse.java by calling the Link Grammar Parser | |
inst/birt/syn | the manually extracted ideal pairs,
as produced by human annotatators with
HandTag.java. inst-files contain instanceOf-pairs, birt-files contain person/birthdate-pairs and syn-files contain synonymy pairs. |
The manually extracted pairs are licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use them, please cite our paper. The other files are licensed under the GNU Free Documentation License, unless they underly different terms by their authors.
Corpus | # Docs | Relation | # annotated | Remarks | |
---|---|---|---|---|---|
Googlecomposers | 492 | instanceOf | 100 | We used Google to search for the baroque, classical and romantic composers of Wikipedia. We downloaded the first page in the result list (using the "I'm feeling lucky" button) excluding Wikipedia pages. This corpus is highly incoherent, containing advertisements as well as pages with no proper sentences at all | |
Wikicomposers | 872 | instanceOf, person/birthdate | 87 | All Wikipedia articles about composers | |
Wikigeography | 313 | synonymy | 130 | All Wikipedia articles about the geography of countries | |
Wikigeneral (-) | 223 | instanceOf | 223 | Some random Wikipedia articles. |