|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object java.text.BreakIterator java.text.RuleBasedBreakIterator java.text.DictionaryBasedBreakIterator
A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: <dictionary>. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in <dictionary>, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.
Nested Class Summary | |
protected class |
DictionaryBasedBreakIterator.Builder
The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator, but extends it with extra logic to handle the " |
Nested classes inherited from class java.text.BreakIterator |
|
Field Summary | |
private int[] |
cachedBreakPositions
when a range of characters is divided up using the dictionary, the break positions that are discovered are stored here, preventing us from having to use either the dictionary or the state table again until the iterator leaves this range of text |
private boolean[] |
categoryFlags
a list of flags indicating which character categories are contained in the dictionary file (this is used to determine which ranges of characters to apply the dictionary to) |
private BreakDictionary |
dictionary
a list of known words that is used to divide up contiguous ranges of letters, stored in a compressed, indexed, format that offers fast access |
private int |
dictionaryCharCount
a temporary hiding place for the number of dictionary characters in the last range passed over by next() |
protected static byte |
IGNORE
A token used as a character-category value to identify ignore characters |
private int |
positionInCache
if cachedBreakPositions is not null, this indicates which item in the cache the current iteration position refers to |
Fields inherited from class java.text.BreakIterator |
DONE |
Constructor Summary | |
DictionaryBasedBreakIterator(String description,
InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator. |
Method Summary | |
protected static void |
checkOffset(int offset,
CharacterIterator text)
Throw IllegalArgumentException unless begin <= offset < end. |
Object |
clone()
Clones this iterator. |
int |
current()
Returns the current iteration position. |
private void |
divideUpDictionaryRange(int startPos,
int endPos)
This is the function that actually implements the dictionary-based algorithm. |
boolean |
equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text. |
int |
first()
Sets the current iteration position to the beginning of the text. |
int |
following(int offset)
Sets the current iteration position to the first boundary position after the specified position. |
CharacterIterator |
getText()
Return a CharacterIterator over the text being analyzed. |
protected int |
handleNext()
This is the implementation function for next(). |
protected int |
handlePrevious()
This method backs the iterator back up to a "safe position" in the text. |
int |
hashCode()
Compute a hashcode for this BreakIterator |
boolean |
isBoundary(int offset)
Returns true if the specfied position is a boundary position. |
int |
last()
Sets the current iteration position to the end of the text. |
protected int |
lookupBackwardState(int state,
int category)
Given a current state and a character category, looks up the next state to transition to in the backwards state table. |
protected int |
lookupCategory(char c)
Looks up a character category for a character. |
protected int |
lookupState(int state,
int category)
Given a current state and a character category, looks up the next state to transition to in the state table. |
protected RuleBasedBreakIterator.Builder |
makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. |
int |
next()
Advances the iterator to the next boundary position. |
int |
next(int n)
Advances the iterator either forward or backward the specified number of steps. |
int |
preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position. |
int |
previous()
Advances the iterator one step backwards. |
void |
setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text. |
String |
toString()
Returns the description used to create this iterator |
Methods inherited from class java.text.BreakIterator |
getAvailableLocales, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getSentenceInstance, getSentenceInstance, getWordInstance, getWordInstance, setText |
Methods inherited from class java.lang.Object |
finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
private BreakDictionary dictionary
private boolean[] categoryFlags
private int dictionaryCharCount
private int[] cachedBreakPositions
private int positionInCache
protected static final byte IGNORE
Constructor Detail |
public DictionaryBasedBreakIterator(String description, InputStream dictionaryStream) throws IOException
description
- Same as the description parameter on RuleBasedBreakIterator,
except for the special meaning of "Method Detail |
protected RuleBasedBreakIterator.Builder makeBuilder()
makeBuilder
in class RuleBasedBreakIterator
public void setText(CharacterIterator newText)
RuleBasedBreakIterator
setText
in class RuleBasedBreakIterator
newText
- An iterator over the text to analyze.public int first()
first
in class RuleBasedBreakIterator
public int last()
last
in class RuleBasedBreakIterator
public int previous()
previous
in class RuleBasedBreakIterator
public int preceding(int offset)
preceding
in class RuleBasedBreakIterator
offset
- The position to begin searching from
public int following(int offset)
following
in class RuleBasedBreakIterator
offset
- The position to begin searching forward from
protected int handleNext()
handleNext
in class RuleBasedBreakIterator
protected int lookupCategory(char c)
lookupCategory
in class RuleBasedBreakIterator
private void divideUpDictionaryRange(int startPos, int endPos)
public Object clone()
clone
in class BreakIterator
public boolean equals(Object that)
equals
in class Object
that
- the reference object with which to compare.
true
if this object is the same as the obj
argument; false
otherwise.Object.hashCode()
,
Hashtable
public String toString()
toString
in class Object
public int hashCode()
hashCode
in class Object
Object.equals(java.lang.Object)
,
Hashtable
public int next(int n)
next
in class BreakIterator
n
- The number of steps to move. The sign indicates the direction
(negative is backwards, and positive is forwards).
public int next()
next
in class BreakIterator
protected static final void checkOffset(int offset, CharacterIterator text)
public boolean isBoundary(int offset)
isBoundary
in class BreakIterator
offset
- the offset to check.
public int current()
current
in class BreakIterator
public CharacterIterator getText()
getText
in class BreakIterator
protected int handlePrevious()
protected int lookupState(int state, int category)
protected int lookupBackwardState(int state, int category)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |