leila.parsing
Class HTMLReader

java.lang.Object
  extended by java.io.Reader
      extended by java.io.FilterReader
          extended by leila.parsing.HTMLReader
All Implemented Interfaces:
java.io.Closeable, java.lang.Readable

public class HTMLReader
extends java.io.FilterReader

This class is part of LEILA (http://mpii.de/yago-naga/leila). It is licensed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0) by the author Fabian M. Suchanek (http://suchanek.name).

The HTML-Reader reads normalized characters from a HTML-file. See Char.java for the definition of "normalized". Tags are returned as "TAG_tagname".
Example:

         HTMLReader r=new HTMLReader(new FileReader("index.html"));
         String s;
         while(null!=(s=r.readString())) System.out.print(s);

         -->
             This is the HTML-file, with resolved ampersand sequences
             and with all characters normalized, even the umlauts ae,
             oe, ue. Tags appear like TAG_I this TAG_/I.
   


Field Summary
protected  java.lang.String internalBuf
           
protected  java.lang.String tagContent
          Holds the content of the last tag
protected  boolean wasWhiteSpace
          Static variable for readString()
 
Fields inherited from class java.io.FilterReader
in
 
Fields inherited from class java.io.Reader
lock
 
Constructor Summary
HTMLReader(java.io.File f)
          Constructs a HTMLReader from a File
HTMLReader(java.io.Reader s)
          Constructs a HTMLReader from a Reader
HTMLReader(java.net.URL url)
          Constructs a HTMLReader for an URL
 
Method Summary
 java.lang.String getLastTagContent()
          Returns the content of the last tag
static void main(java.lang.String[] argv)
          Test routine
 int read()
          Returns a single character.
 int read(java.nio.CharBuffer buffi)
          Reads into a charbuffer
 java.lang.String readString()
          Reads a character, returns null for EndOfFile and "TAG_tagname" for tags.
 java.lang.String readTaggedText(java.lang.String t)
          Seeks the next tag of name t and returns all text to the terminating tag /t.
 java.lang.String readText(int n)
          Reads a sequence of characters up to the blank following the nth char, ignores tags
 java.lang.String readTextChar()
          Reads a character, ignores tags
 boolean scrollTo(java.lang.String s)
          Seeks a specific string and scrolls to it, returns TRUE if found
 
Methods inherited from class java.io.FilterReader
close, mark, markSupported, read, ready, reset, skip
 
Methods inherited from class java.io.Reader
read
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

internalBuf

protected java.lang.String internalBuf

tagContent

protected java.lang.String tagContent
Holds the content of the last tag


wasWhiteSpace

protected boolean wasWhiteSpace
Static variable for readString()

Constructor Detail

HTMLReader

public HTMLReader(java.io.Reader s)
Constructs a HTMLReader from a Reader


HTMLReader

public HTMLReader(java.net.URL url)
           throws java.io.IOException
Constructs a HTMLReader for an URL

Throws:
java.io.IOException

HTMLReader

public HTMLReader(java.io.File f)
           throws java.io.FileNotFoundException
Constructs a HTMLReader from a File

Throws:
java.io.FileNotFoundException
Method Detail

readText

public java.lang.String readText(int n)
                          throws java.io.IOException
Reads a sequence of characters up to the blank following the nth char, ignores tags

Throws:
java.io.IOException

readTextChar

public java.lang.String readTextChar()
                              throws java.io.IOException
Reads a character, ignores tags

Throws:
java.io.IOException

read

public int read()
         throws java.io.IOException
Returns a single character. Deprecated! Use readString() instead, because a single read character (like umlauts) may translate to multiple output characters. Tags are not returned

Overrides:
read in class java.io.FilterReader
Throws:
java.io.IOException

read

public int read(java.nio.CharBuffer buffi)
         throws java.io.IOException
Reads into a charbuffer

Specified by:
read in interface java.lang.Readable
Overrides:
read in class java.io.Reader
Throws:
java.io.IOException

getLastTagContent

public java.lang.String getLastTagContent()
Returns the content of the last tag


readString

public java.lang.String readString()
                            throws java.io.IOException
Reads a character, returns null for EndOfFile and "TAG_tagname" for tags. Any whitespace is returned as one blank.

Throws:
java.io.IOException

readTaggedText

public java.lang.String readTaggedText(java.lang.String t)
                                throws java.io.IOException
Seeks the next tag of name t and returns all text to the terminating tag /t. Nesting is not supported. Returns null if t was not found. The result text is normalized.

Throws:
java.io.IOException

scrollTo

public boolean scrollTo(java.lang.String s)
                 throws java.io.IOException
Seeks a specific string and scrolls to it, returns TRUE if found

Throws:
java.io.IOException

main

public static void main(java.lang.String[] argv)
                 throws java.lang.Exception
Test routine

Throws:
java.lang.Exception