|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectsofie.parsing.Tokenizer
public class Tokenizer
Class Tokenizer This class is part of the SOFIE system (http://mpii.de/yago-naga/sofie). It is licensed under the Creative Commons Attribution-Noncommercial-Share-Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/) by Fabian M. Suchanek (http://suchanek.name). If you use this class for scientific purposes, please cite Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum "SOFIE: A Self-Organizing Framework for Information Extraction" (International World Wide Web Conference 2009) Tokenizes a String.
Nested Class Summary | |
---|---|
static class |
Tokenizer.HTMLTokenizer
Tokenizer for HTML |
protected static class |
Tokenizer.PatternRunner
A pattern runner is an object that finds one specific pattern in a string. |
static class |
Tokenizer.WikiTokenizer
Tokenizer for Wikipedia |
Field Summary | |
---|---|
protected static java.util.List<Tokenizer.PatternRunner> |
standardPatternRunners
Standard Pattern runners |
Constructor Summary | |
---|---|
Tokenizer()
|
Method Summary | |
---|---|
static java.util.List<Token> |
autodetectTokenize(java.io.File f)
Tokenizes a file |
static Token |
canonicalizeKnownWord(Token s)
Returns a token for a predefined word |
static Tokenizer |
forFile(java.io.File f)
Returns a tokenizer for a file |
java.lang.CharSequence |
load(java.io.File f)
Loads the file into memory |
static void |
main(java.lang.String[] args)
Test |
protected java.util.List<Tokenizer.PatternRunner> |
patternRunners()
Pattern runners that are to be run for this specific type of tokenizer. |
protected java.lang.String |
prepare(java.lang.CharSequence s)
Prepares a string for tokenizing. |
java.util.List<Token> |
tokenize(java.lang.CharSequence s)
Main method: Tokenizes a char sequence |
java.util.List<Token> |
tokenize(java.io.File f)
Tokenizes a file |
static java.util.List<Token> |
tokenizeHTML(java.io.File f)
Tokenizes an HTML document |
static java.util.List<Token> |
tokenizeString(java.lang.String text)
Tokenizes a file |
static java.util.List<Token> |
tokenizeText(java.io.File f)
Tokenizes an text document |
static java.util.List<Token> |
tokenizeWikipedia(java.io.File f)
Tokenizes a Wikipedia article completely |
static java.util.List<Token> |
tokenizeWikipediaNoInfobox(java.io.File f)
Tokenizes a Wikipedia article, omitting the infoboxes |
static java.util.List<Token> |
tokenizeWikipediaOnlyInfobox(java.io.File f)
Tokenizes the infobox of a Wikipedia article |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final java.util.List<Tokenizer.PatternRunner> standardPatternRunners
Constructor Detail |
---|
public Tokenizer()
Method Detail |
---|
public static Token canonicalizeKnownWord(Token s)
protected java.util.List<Tokenizer.PatternRunner> patternRunners()
protected java.lang.String prepare(java.lang.CharSequence s)
public java.util.List<Token> tokenize(java.lang.CharSequence s)
public java.lang.CharSequence load(java.io.File f) throws java.io.IOException
java.io.IOException
public java.util.List<Token> tokenize(java.io.File f) throws java.io.IOException
java.io.IOException
public static Tokenizer forFile(java.io.File f)
public static java.util.List<Token> tokenizeString(java.lang.String text) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> autodetectTokenize(java.io.File f) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> tokenizeWikipediaNoInfobox(java.io.File f) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> tokenizeWikipediaOnlyInfobox(java.io.File f) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> tokenizeWikipedia(java.io.File f) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> tokenizeHTML(java.io.File f) throws java.io.IOException
java.io.IOException
public static java.util.List<Token> tokenizeText(java.io.File f) throws java.io.IOException
java.io.IOException
public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |