leila.parsing
Class HTML2LGI

java.lang.Object
  extended by leila.parsing.HTML2LGI

public class HTML2LGI
extends java.lang.Object

This class is part of LEILA (http://mpii.de/yago-naga/leila). It is licensed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0) by the author Fabian M. Suchanek (http://suchanek.name).

HTML2LGI takes HTML-files and produces LGI-files (Link Grammar Input files) and LL-files (Linear Linkage files, non-grammatical part).


Field Summary
static  abbreviations
          Abbreviations
static  emphEnd
          HTML-tags that finish the emphasis of a string
static  emphStart
          HTML-tags that emphasize a string
static java.lang.String ignoreChars
          Characters to be ignored
static  ignores
          HTML-tags to be ignored
protected static HTMLReader in
          Readers and Writers
protected static java.io.Writer lgiout
           
protected static java.io.Writer llout
           
static int MAXJOIN
          Maximal length of a word, must be <60, which is the maximum that the link grammar parser can swallow
protected static int MAXLLLENGTH
          Maximal linear linkage length (in chars)
protected static int MINLGILENGTH
          Minimal sentence length (in chars)
protected static int MINLLLENGTH
          Minimal linear linkage length (in chars)
static  skips
          HTML-tags to be skipped
static java.lang.String stopChars
          Characters that count as sentence delimiters
static  stops
          HTML-tags that count as sentence delimiters
 
Constructor Summary
HTML2LGI()
           
 
Method Summary
protected static void flush(java.lang.String headline, java.lang.StringBuilder s)
          Flushes data, resets s
protected static void flushLGI(java.lang.String s)
          Flushes data to LGIfile
protected static void flushLL(java.lang.String s)
          Flushes data to LLfile
static void main(java.io.File f)
          Translates HTML-files to a LGI-file (call by Java)
static void main(java.lang.String[] argv)
          Translates HTML-files to a LGI-file (call by User)
protected static void parseFile(java.io.File f)
          Parses an HTML-file
protected static void parseText(java.lang.String headline)
          Parses a text under a given headline
protected static java.lang.StringBuilder readSentence(java.lang.String[] delim)
          Returns a sequence of unproblematic characters and its delimiter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAXJOIN

public static final int MAXJOIN
Maximal length of a word, must be <60, which is the maximum that the link grammar parser can swallow

See Also:
Constant Field Values

in

protected static HTMLReader in
Readers and Writers


lgiout

protected static java.io.Writer lgiout

llout

protected static java.io.Writer llout

MINLGILENGTH

protected static final int MINLGILENGTH
Minimal sentence length (in chars)

See Also:
Constant Field Values

MINLLLENGTH

protected static final int MINLLLENGTH
Minimal linear linkage length (in chars)

See Also:
Constant Field Values

MAXLLLENGTH

protected static final int MAXLLLENGTH
Maximal linear linkage length (in chars)

See Also:
Constant Field Values

abbreviations

public static  abbreviations
Abbreviations


ignoreChars

public static final java.lang.String ignoreChars
Characters to be ignored

See Also:
Constant Field Values

ignores

public static  ignores
HTML-tags to be ignored


skips

public static  skips
HTML-tags to be skipped


stopChars

public static final java.lang.String stopChars
Characters that count as sentence delimiters

See Also:
Constant Field Values

stops

public static  stops
HTML-tags that count as sentence delimiters


emphStart

public static final  emphStart
HTML-tags that emphasize a string


emphEnd

public static final  emphEnd
HTML-tags that finish the emphasis of a string

Constructor Detail

HTML2LGI

public HTML2LGI()
Method Detail

flushLGI

protected static void flushLGI(java.lang.String s)
                        throws java.lang.Exception
Flushes data to LGIfile

Throws:
java.lang.Exception

flushLL

protected static void flushLL(java.lang.String s)
                       throws java.lang.Exception
Flushes data to LLfile

Throws:
java.lang.Exception

flush

protected static void flush(java.lang.String headline,
                            java.lang.StringBuilder s)
                     throws java.lang.Exception
Flushes data, resets s

Throws:
java.lang.Exception

readSentence

protected static java.lang.StringBuilder readSentence(java.lang.String[] delim)
                                               throws java.lang.Exception
Returns a sequence of unproblematic characters and its delimiter

Throws:
java.lang.Exception

parseText

protected static void parseText(java.lang.String headline)
                         throws java.lang.Exception
Parses a text under a given headline

Throws:
java.lang.Exception

parseFile

protected static void parseFile(java.io.File f)
                         throws java.lang.Exception
Parses an HTML-file

Throws:
java.lang.Exception

main

public static void main(java.io.File f)
                 throws java.lang.Exception
Translates HTML-files to a LGI-file (call by Java)

Throws:
java.lang.Exception

main

public static void main(java.lang.String[] argv)
                 throws java.lang.Exception
Translates HTML-files to a LGI-file (call by User)

Throws:
java.lang.Exception