sofie.parsing
Class Tokenizer

java.lang.Object
  extended by sofie.parsing.Tokenizer
Direct Known Subclasses:
Tokenizer.HTMLTokenizer, Tokenizer.WikiTokenizer

public class Tokenizer
extends java.lang.Object

Class Tokenizer This class is part of the SOFIE system (http://mpii.de/yago-naga/sofie). It is licensed under the Creative Commons Attribution-Noncommercial-Share-Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/) by Fabian M. Suchanek (http://suchanek.name). If you use this class for scientific purposes, please cite Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum "SOFIE: A Self-Organizing Framework for Information Extraction" (International World Wide Web Conference 2009) Tokenizes a String.


Nested Class Summary
static class Tokenizer.HTMLTokenizer
          Tokenizer for HTML
protected static class Tokenizer.PatternRunner
          A pattern runner is an object that finds one specific pattern in a string.
static class Tokenizer.WikiTokenizer
          Tokenizer for Wikipedia
 
Field Summary
protected static java.util.List<Tokenizer.PatternRunner> standardPatternRunners
          Standard Pattern runners
 
Constructor Summary
Tokenizer()
           
 
Method Summary
static java.util.List<Token> autodetectTokenize(java.io.File f)
          Tokenizes a file
static Token canonicalizeKnownWord(Token s)
          Returns a token for a predefined word
static Tokenizer forFile(java.io.File f)
          Returns a tokenizer for a file
 java.lang.CharSequence load(java.io.File f)
          Loads the file into memory
static void main(java.lang.String[] args)
          Test
protected  java.util.List<Tokenizer.PatternRunner> patternRunners()
          Pattern runners that are to be run for this specific type of tokenizer.
protected  java.lang.String prepare(java.lang.CharSequence s)
          Prepares a string for tokenizing.
 java.util.List<Token> tokenize(java.lang.CharSequence s)
          Main method: Tokenizes a char sequence
 java.util.List<Token> tokenize(java.io.File f)
          Tokenizes a file
static java.util.List<Token> tokenizeHTML(java.io.File f)
          Tokenizes an HTML document
static java.util.List<Token> tokenizeString(java.lang.String text)
          Tokenizes a file
static java.util.List<Token> tokenizeText(java.io.File f)
          Tokenizes an text document
static java.util.List<Token> tokenizeWikipedia(java.io.File f)
          Tokenizes a Wikipedia article completely
static java.util.List<Token> tokenizeWikipediaNoInfobox(java.io.File f)
          Tokenizes a Wikipedia article, omitting the infoboxes
static java.util.List<Token> tokenizeWikipediaOnlyInfobox(java.io.File f)
          Tokenizes the infobox of a Wikipedia article
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

standardPatternRunners

protected static final java.util.List<Tokenizer.PatternRunner> standardPatternRunners
Standard Pattern runners

Constructor Detail

Tokenizer

public Tokenizer()
Method Detail

canonicalizeKnownWord

public static Token canonicalizeKnownWord(Token s)
Returns a token for a predefined word


patternRunners

protected java.util.List<Tokenizer.PatternRunner> patternRunners()
Pattern runners that are to be run for this specific type of tokenizer. Default: standardPatternRunners


prepare

protected java.lang.String prepare(java.lang.CharSequence s)
Prepares a string for tokenizing. Default: Normalize numbers and dates


tokenize

public java.util.List<Token> tokenize(java.lang.CharSequence s)
Main method: Tokenizes a char sequence


load

public java.lang.CharSequence load(java.io.File f)
                            throws java.io.IOException
Loads the file into memory

Throws:
java.io.IOException

tokenize

public java.util.List<Token> tokenize(java.io.File f)
                               throws java.io.IOException
Tokenizes a file

Throws:
java.io.IOException

forFile

public static Tokenizer forFile(java.io.File f)
Returns a tokenizer for a file


tokenizeString

public static java.util.List<Token> tokenizeString(java.lang.String text)
                                            throws java.io.IOException
Tokenizes a file

Throws:
java.io.IOException

autodetectTokenize

public static java.util.List<Token> autodetectTokenize(java.io.File f)
                                                throws java.io.IOException
Tokenizes a file

Throws:
java.io.IOException

tokenizeWikipediaNoInfobox

public static java.util.List<Token> tokenizeWikipediaNoInfobox(java.io.File f)
                                                        throws java.io.IOException
Tokenizes a Wikipedia article, omitting the infoboxes

Throws:
java.io.IOException

tokenizeWikipediaOnlyInfobox

public static java.util.List<Token> tokenizeWikipediaOnlyInfobox(java.io.File f)
                                                          throws java.io.IOException
Tokenizes the infobox of a Wikipedia article

Throws:
java.io.IOException

tokenizeWikipedia

public static java.util.List<Token> tokenizeWikipedia(java.io.File f)
                                               throws java.io.IOException
Tokenizes a Wikipedia article completely

Throws:
java.io.IOException

tokenizeHTML

public static java.util.List<Token> tokenizeHTML(java.io.File f)
                                          throws java.io.IOException
Tokenizes an HTML document

Throws:
java.io.IOException

tokenizeText

public static java.util.List<Token> tokenizeText(java.io.File f)
                                          throws java.io.IOException
Tokenizes an text document

Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Test

Throws:
java.lang.Exception