sofie.parsing
Class Tokenizer.HTMLTokenizer

java.lang.Object
  extended by sofie.parsing.Tokenizer
      extended by sofie.parsing.Tokenizer.HTMLTokenizer
Enclosing class:
Tokenizer

public static class Tokenizer.HTMLTokenizer
extends Tokenizer

Tokenizer for HTML


Nested Class Summary
 
Nested classes/interfaces inherited from class sofie.parsing.Tokenizer
Tokenizer.HTMLTokenizer, Tokenizer.PatternRunner, Tokenizer.WikiTokenizer
 
Field Summary
protected static java.util.List<Tokenizer.PatternRunner> htmlPatternRunners
          HTML-specific patterns
 
Fields inherited from class sofie.parsing.Tokenizer
standardPatternRunners
 
Constructor Summary
Tokenizer.HTMLTokenizer()
           
 
Method Summary
 java.lang.CharSequence load(java.io.File f)
          Loads the file, decodes UTF8
protected  java.util.List<Tokenizer.PatternRunner> patternRunners()
          Pattern runners that are to be run for this specific type of tokenizer.
protected  java.lang.String prepare(java.lang.CharSequence s)
          Decodes also Ampersands
 
Methods inherited from class sofie.parsing.Tokenizer
autodetectTokenize, canonicalizeKnownWord, forFile, main, tokenize, tokenize, tokenizeHTML, tokenizeString, tokenizeText, tokenizeWikipedia, tokenizeWikipediaNoInfobox, tokenizeWikipediaOnlyInfobox
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

htmlPatternRunners

protected static final java.util.List<Tokenizer.PatternRunner> htmlPatternRunners
HTML-specific patterns

Constructor Detail

Tokenizer.HTMLTokenizer

public Tokenizer.HTMLTokenizer()
Method Detail

patternRunners

protected java.util.List<Tokenizer.PatternRunner> patternRunners()
Description copied from class: Tokenizer
Pattern runners that are to be run for this specific type of tokenizer. Default: standardPatternRunners

Overrides:
patternRunners in class Tokenizer

prepare

protected java.lang.String prepare(java.lang.CharSequence s)
Decodes also Ampersands

Overrides:
prepare in class Tokenizer

load

public java.lang.CharSequence load(java.io.File f)
                            throws java.io.IOException
Loads the file, decodes UTF8

Overrides:
load in class Tokenizer
Throws:
java.io.IOException