LEILA: Learning to Extract Information by Linguistic Analysis

Research Downloads Corpora Publications People

What this is

This is a set of corpora for relation extraction. Relation extraction is the task of, given a semantic target relation and given a natural language corpus, extracting all pairs of entities in the corpus that stand in the target relation. For example, if the target relation is instanceOf and the corpus contains the following passage

"President Mickey M. Mouse was happy to visit the city of Washington D.C., which is the capital of the United States."

then the goal is to extract the following pairs:

`instanceOf`

	Mickey M. Mouse	president
	Washington D.C.	city
	Washington D.C.	captial

This web site provides corpora for evaluating Relation Extraction systems. For each document in the corpus, we provide a list of manually extracted ideal pairs that should be extracted by the system. Note that these pairs are not linked to the original sentence, but only to the document. The corpora were used with LEILA.

What types of files we have

	`html`	the original document
	`lgi`	the proper sentences of the original document (Link Grammar Input), as extracted by HTML2LGI.java
	`ll`	the non-grammatical parts of the sentences of the original document, as extracted by HTML2LGI.java
	`lgo`	the parsed version of the proper sentences (Link Grammar Output), as produced by LGParse.java by calling the Link Grammar Parser
	`inst/birt/syn`	the manually extracted ideal pairs, as produced by human annotatators with HandTag.java. `inst`-files contain instanceOf-pairs, `birt`-files contain person/birthdate-pairs and `syn`-files contain synonymy pairs.

The manually extracted pairs are licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use them, please cite our paper. The other files are licensed under the GNU Free Documentation License, unless they underly different terms by their authors.

Corpora

Corpus	# Docs	Relation	# annotated	Remarks
Googlecomposers	492	instanceOf	100	We used Google to search for the baroque, classical and romantic composers of Wikipedia. We downloaded the first page in the result list (using the "I'm feeling lucky" button) excluding Wikipedia pages. This corpus is highly incoherent, containing advertisements as well as pages with no proper sentences at all
Wikicomposers	872	instanceOf, person/birthdate	87	All Wikipedia articles about composers
Wikigeography	313	synonymy	130	All Wikipedia articles about the geography of countries
Wikigeneral (-)	223	instanceOf	223	Some random Wikipedia articles.