Leila.class
. LEILA will tell you how to set it up.
The flow of data with LEILA is as follows:
Corpus -> Proper sentences -> Parsed sentences -> Model -> Output Pairs documents (*.LGI) (*.LGO) (*.MDL) (*.TXT) (*.HTML) '-------------------------> '---HTML2LGI.java--' '----LGParse.java---' '--Train.java---' '---Test.java---' '--------------------------Leila.java------------------------------------------'
The corpus can be any set of text or HTML documents. These documents can be spread across
different folders or subfolders. The class HTML2LGI.java
extracts the proper sentences from from the corpus
documents. Each document generates one LGI file containing the sentences.
These LGI-files are given to the Link Grammar Parser (called by LGParse.java
),
which produces parse trees for the sentences. Each LGI-file generates one LGO-file containing the parse trees.
The class Train.java
tries to find patterns for the target relation in the LGO-files. It generalizes these
patterns and stores them as a model in a MDL-file. The class Test.java
applies the model to extract output pairs for the target relation from the LGO-files. It stores
them in one large plain text file. All of these steps are done automatically in the right order by Leila.java
.
Train.java
must know the target relation. The target relation is given by a function
that decides whether a pair of words is an example, a counterexample or a candidate for the relation.
This function should be implemented in a class that extends Relation.java
.
To LEILA, it does not matter how the function actually works internally. The most common way is to load a list of
example pairs from a text file. To decide whether a pair of words is an example pair, the function can just check
whether the pair is in the list. Often, the counterexamples need not be present in a list, but they can be
deduced algorithmically on the fly. See the experimental section of
"LEILA: Learning to Extract Information by Linguistic Analysis"
(pdf,
ppt,
bib)
for examples.
The following relations ship with LEILA:
InstanceOf.java (extends Relation.java)
is the relation between an entity and its concept.
The example pairs come from WordNet and are included in the distribution.
Synonymy.java (extends Relation.java)
is the relation between synonymous words.
The example pairs come from WordNet and are included in the distribution.
Headquarters.java (extends Relation.java)
is the relation between a company and the city
of its headquarters. The example pairs are not included in the distribution, because they depend on the
corpus, which is copyright restricted.
Birthdates.java (extends Relation.java)
is the relation between a person and her birth date.
The example pairs are not included in the distribution due to copyright restrictions.
SimpleFunction.java (extends Relation.java)
is a many-to-one relation for demonstration purposes.
The example pairs are included in the distribution.
StupidRelation.java (extends Relation.java)
is a relation of just one pair for debugging purposes.
The example pair is hard-coded in the source.