sofie.parsing
Class Tokenizer.WikiTokenizer

java.lang.Object
  extended by sofie.parsing.Tokenizer
      extended by sofie.parsing.Tokenizer.WikiTokenizer
Enclosing class:
Tokenizer

public static class Tokenizer.WikiTokenizer
extends Tokenizer

Tokenizer for Wikipedia


Nested Class Summary
 
Nested classes/interfaces inherited from class sofie.parsing.Tokenizer
Tokenizer.HTMLTokenizer, Tokenizer.PatternRunner, Tokenizer.WikiTokenizer
 
Field Summary
protected static Tokenizer.PatternRunner infoboxKiller
          A pattern runner that eliminates all infoboxes
protected static Tokenizer.PatternRunner infoboxSaver
          A pattern runner that eliminates everything except infoboxes
protected static java.util.List<Tokenizer.PatternRunner> wikiPatternRunners
          Wikipedia-specific patterns
 
Fields inherited from class sofie.parsing.Tokenizer
standardPatternRunners
 
Constructor Summary
Tokenizer.WikiTokenizer()
           
 
Method Summary
 java.lang.CharSequence load(java.io.File f)
          Loads the file, decodes UTF8
protected  java.util.List<Tokenizer.PatternRunner> patternRunners()
          Pattern runners that are to be run for this specific type of tokenizer.
protected  java.lang.String prepare(java.lang.CharSequence s)
          Decodes Ampersand
 java.util.List<Token> tokenizeNoInfobox(java.lang.CharSequence s)
          Tokenizes the Wikipedia article, but removes the infobox
 java.util.List<Token> tokenizeNoInfobox(java.io.File f)
          Tokenizes the Wikipedia article, but removes the infobox
 java.util.List<Token> tokenizeOnlyInfobox(java.lang.CharSequence s)
          Tokenizes the Wikipedia article, but removes everything outside the infobox
 java.util.List<Token> tokenizeOnlyInfobox(java.io.File f)
          Tokenizes the Wikipedia article, but only the infobox
 
Methods inherited from class sofie.parsing.Tokenizer
autodetectTokenize, canonicalizeKnownWord, forFile, main, tokenize, tokenize, tokenizeHTML, tokenizeString, tokenizeText, tokenizeWikipedia, tokenizeWikipediaNoInfobox, tokenizeWikipediaOnlyInfobox
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wikiPatternRunners

protected static final java.util.List<Tokenizer.PatternRunner> wikiPatternRunners
Wikipedia-specific patterns


infoboxKiller

protected static final Tokenizer.PatternRunner infoboxKiller
A pattern runner that eliminates all infoboxes


infoboxSaver

protected static final Tokenizer.PatternRunner infoboxSaver
A pattern runner that eliminates everything except infoboxes

Constructor Detail

Tokenizer.WikiTokenizer

public Tokenizer.WikiTokenizer()
Method Detail

patternRunners

protected java.util.List<Tokenizer.PatternRunner> patternRunners()
Description copied from class: Tokenizer
Pattern runners that are to be run for this specific type of tokenizer. Default: standardPatternRunners

Overrides:
patternRunners in class Tokenizer

prepare

protected java.lang.String prepare(java.lang.CharSequence s)
Decodes Ampersand

Overrides:
prepare in class Tokenizer

load

public java.lang.CharSequence load(java.io.File f)
                            throws java.io.IOException
Loads the file, decodes UTF8

Overrides:
load in class Tokenizer
Throws:
java.io.IOException

tokenizeNoInfobox

public java.util.List<Token> tokenizeNoInfobox(java.lang.CharSequence s)
Tokenizes the Wikipedia article, but removes the infobox


tokenizeOnlyInfobox

public java.util.List<Token> tokenizeOnlyInfobox(java.lang.CharSequence s)
Tokenizes the Wikipedia article, but removes everything outside the infobox


tokenizeNoInfobox

public java.util.List<Token> tokenizeNoInfobox(java.io.File f)
                                        throws java.io.IOException
Tokenizes the Wikipedia article, but removes the infobox

Throws:
java.io.IOException

tokenizeOnlyInfobox

public java.util.List<Token> tokenizeOnlyInfobox(java.io.File f)
                                          throws java.io.IOException
Tokenizes the Wikipedia article, but only the infobox

Throws:
java.io.IOException