LEILA: Learning to Extract Information by Linguistic Analysis

Research Downloads Corpora Publications People

Downloads

LEILA source code (Java) and documentation
This code is licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use the code, please cite our paper.
Browse the documentation
See our corpora.

How to use LEILA

Download the Java tools
Download the Link Grammar Parser
Download some recent version of Java (1.5+) if you don't have it
Download the Java-source, the class-files and the documentation of LEILA here
Run Leila.class. LEILA will tell you how to set it up.

How data flows in LEILA

The flow of data with LEILA is as follows:

    Corpus      ->   Proper sentences  ->  Parsed sentences  ->  Model   ->  Output Pairs
    documents        (*.LGI)               (*.LGO)               (*.MDL)     (*.TXT)
    (*.HTML)                                     '------------------------->

       '---HTML2LGI.java--'  '----LGParse.java---' '--Train.java---'  '---Test.java---'

       '--------------------------Leila.java------------------------------------------'

The corpus can be any set of text or HTML documents. These documents can be spread across different folders or subfolders. The class HTML2LGI.java extracts the proper sentences from from the corpus documents. Each document generates one LGI file containing the sentences. These LGI-files are given to the Link Grammar Parser (called by LGParse.java), which produces parse trees for the sentences. Each LGI-file generates one LGO-file containing the parse trees. The class Train.java tries to find patterns for the target relation in the LGO-files. It generalizes these patterns and stores them as a model in a MDL-file. The class Test.java applies the model to extract output pairs for the target relation from the LGO-files. It stores them in one large plain text file. All of these steps are done automatically in the right order by Leila.java.

Train.java must know the target relation. The target relation is given by a function that decides whether a pair of words is an example, a counterexample or a candidate for the relation. This function should be implemented in a class that extends Relation.java. To LEILA, it does not matter how the function actually works internally. The most common way is to load a list of example pairs from a text file. To decide whether a pair of words is an example pair, the function can just check whether the pair is in the list. Often, the counterexamples need not be present in a list, but they can be deduced algorithmically on the fly. See the experimental section of "LEILA: Learning to Extract Information by Linguistic Analysis" (pdf, ppt, bib) for examples.

Existing relations in LEILA

The following relations ship with LEILA:

InstanceOf.java (extends Relation.java) is the relation between an entity and its concept. The example pairs come from WordNet and are included in the distribution.
Synonymy.java (extends Relation.java) is the relation between synonymous words. The example pairs come from WordNet and are included in the distribution.
Headquarters.java (extends Relation.java) is the relation between a company and the city of its headquarters. The example pairs are not included in the distribution, because they depend on the corpus, which is copyright restricted.
Birthdates.java (extends Relation.java) is the relation between a person and her birth date. The example pairs are not included in the distribution due to copyright restrictions.
SimpleFunction.java (extends Relation.java) is a many-to-one relation for demonstration purposes. The example pairs are included in the distribution.
StupidRelation.java (extends Relation.java) is a relation of just one pair for debugging purposes. The example pair is hard-coded in the source.