DictionaryBasedBreakIterator

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

java.text
Class DictionaryBasedBreakIterator

java.lang.Object
  java.text.BreakIterator
      java.text.RuleBasedBreakIterator
          java.text.DictionaryBasedBreakIterator

All Implemented Interfaces:: Cloneable

class DictionaryBasedBreakIterator
extends RuleBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: <dictionary>. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in <dictionary>, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Nested Class Summary
`protected class`	`DictionaryBasedBreakIterator.Builder` The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator, but extends it with extra logic to handle the "" token

Nested classes inherited from class java.text.BreakIterator

Field Summary
`private int[]`	`cachedBreakPositions` when a range of characters is divided up using the dictionary, the break positions that are discovered are stored here, preventing us from having to use either the dictionary or the state table again until the iterator leaves this range of text
`private boolean[]`	`categoryFlags` a list of flags indicating which character categories are contained in the dictionary file (this is used to determine which ranges of characters to apply the dictionary to)
`private BreakDictionary`	`dictionary` a list of known words that is used to divide up contiguous ranges of letters, stored in a compressed, indexed, format that offers fast access
`private int`	`dictionaryCharCount` a temporary hiding place for the number of dictionary characters in the last range passed over by next()
`protected static byte`	`IGNORE` A token used as a character-category value to identify ignore characters
`private int`	`positionInCache` if cachedBreakPositions is not null, this indicates which item in the cache the current iteration position refers to

Fields inherited from class java.text.BreakIterator

DONE

Constructor Summary
`DictionaryBasedBreakIterator(String description, InputStream dictionaryStream)` Constructs a DictionaryBasedBreakIterator.

Method Summary
`protected static void`	`checkOffset(int offset, CharacterIterator text)` Throw IllegalArgumentException unless begin <= offset < end.
`Object`	`clone()` Clones this iterator.
`int`	`current()` Returns the current iteration position.
`private void`	`divideUpDictionaryRange(int startPos, int endPos)` This is the function that actually implements the dictionary-based algorithm.
`boolean`	`equals(Object that)` Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.
`int`	`first()` Sets the current iteration position to the beginning of the text.
`int`	`following(int offset)` Sets the current iteration position to the first boundary position after the specified position.
`CharacterIterator`	`getText()` Return a CharacterIterator over the text being analyzed.
`protected int`	`handleNext()` This is the implementation function for next().
`protected int`	`handlePrevious()` This method backs the iterator back up to a "safe position" in the text.
`int`	`hashCode()` Compute a hashcode for this BreakIterator
`boolean`	`isBoundary(int offset)` Returns true if the specfied position is a boundary position.
`int`	`last()` Sets the current iteration position to the end of the text.
`protected int`	`lookupBackwardState(int state, int category)` Given a current state and a character category, looks up the next state to transition to in the backwards state table.
`protected int`	`lookupCategory(char c)` Looks up a character category for a character.
`protected int`	`lookupState(int state, int category)` Given a current state and a character category, looks up the next state to transition to in the state table.
`protected RuleBasedBreakIterator.Builder`	`makeBuilder()` Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
`int`	`next()` Advances the iterator to the next boundary position.
`int`	`next(int n)` Advances the iterator either forward or backward the specified number of steps.
`int`	`preceding(int offset)` Sets the current iteration position to the last boundary position before the specified position.
`int`	`previous()` Advances the iterator one step backwards.
`void`	`setText(CharacterIterator newText)` Set the iterator to analyze a new piece of text.
`String`	`toString()` Returns the description used to create this iterator

Methods inherited from class java.text.BreakIterator

getAvailableLocales, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getSentenceInstance, getSentenceInstance, getWordInstance, getWordInstance, setText

Methods inherited from class java.lang.Object

finalize, getClass, notify, notifyAll, wait, wait, wait

Field Detail

dictionary

private BreakDictionary dictionary

a list of known words that is used to divide up contiguous ranges of letters, stored in a compressed, indexed, format that offers fast access

categoryFlags

private boolean[] categoryFlags

a list of flags indicating which character categories are contained in the dictionary file (this is used to determine which ranges of characters to apply the dictionary to)

dictionaryCharCount

private int dictionaryCharCount

a temporary hiding place for the number of dictionary characters in the last range passed over by next()

cachedBreakPositions

private int[] cachedBreakPositions

when a range of characters is divided up using the dictionary, the break positions that are discovered are stored here, preventing us from having to use either the dictionary or the state table again until the iterator leaves this range of text

positionInCache

private int positionInCache

if cachedBreakPositions is not null, this indicates which item in the cache the current iteration position refers to

IGNORE

protected static final byte IGNORE

A token used as a character-category value to identify ignore characters

See Also:: Constant Field Values

Constructor Detail

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(String description,
                                    InputStream dictionaryStream)
                             throws IOException

Constructs a DictionaryBasedBreakIterator.
Parameters:: description - Same as the description parameter on RuleBasedBreakIterator, except for the special meaning of "". This parameter is just passed through to RuleBasedBreakIterator's constructor.

Method Detail

makeBuilder

protected RuleBasedBreakIterator.Builder makeBuilder()

Returns a Builder that is customized to build a DictionaryBasedBreakIterator. This is the same as RuleBasedBreakIterator.Builder, except for the extra code to handle the tag.

Overrides:: makeBuilder in class RuleBasedBreakIterator

setText

public void setText(CharacterIterator newText)

Description copied from class: RuleBasedBreakIterator

Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

Overrides:: setText in class RuleBasedBreakIterator

Parameters:: newText - An iterator over the text to analyze.

first

public int first()

Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Overrides:: first in class RuleBasedBreakIterator

Returns:: The offset of the beginning of the text.

last

public int last()

Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Overrides:: last in class RuleBasedBreakIterator

Returns:: The text's past-the-end offset.

public int previous()

Advances the iterator one step backwards.

Overrides:: previous in class RuleBasedBreakIterator

Returns:: The position of the last boundary position before the current iteration position

preceding

public int preceding(int offset)

Sets the current iteration position to the last boundary position before the specified position.

Overrides:: preceding in class RuleBasedBreakIterator

Parameters:: offset - The position to begin searching from
Returns:: The position of the last boundary before "offset"

following

public int following(int offset)

Sets the current iteration position to the first boundary position after the specified position.

Overrides:: following in class RuleBasedBreakIterator

Parameters:: offset - The position to begin searching forward from
Returns:: The position of the first boundary after "offset"

handleNext

protected int handleNext()

This is the implementation function for next().

Overrides:: handleNext in class RuleBasedBreakIterator

lookupCategory

protected int lookupCategory(char c)

Looks up a character category for a character.

Overrides:: lookupCategory in class RuleBasedBreakIterator

divideUpDictionaryRange

private void divideUpDictionaryRange(int startPos,
                                     int endPos)

This is the function that actually implements the dictionary-based algorithm. Given the endpoints of a range of text, it uses the dictionary to determine the positions of any boundaries in this range. It stores all the boundary positions it discovers in cachedBreakPositions so that we only have to do this work once for each time we enter the range.

clone

public Object clone()

Clones this iterator.

Overrides:: clone in class BreakIterator

Returns:: A newly-constructed RuleBasedBreakIterator with the same behavior as this one.

equals

public boolean equals(Object that)

Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.

Overrides:: equals in class Object

Parameters:: that - the reference object with which to compare.
Returns:: true if this object is the same as the obj argument; false otherwise.
See Also:: Object.hashCode(), Hashtable

toString

public String toString()

Returns the description used to create this iterator

Overrides:: toString in class Object

Returns:: a string representation of the object.

hashCode

public int hashCode()

Compute a hashcode for this BreakIterator

Overrides:: hashCode in class Object

Returns:: A hash code
See Also:: Object.equals(java.lang.Object), Hashtable

public int next(int n)

Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().

Specified by:: next in class BreakIterator

Parameters:: n - The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).
Returns:: The character offset of the boundary position n boundaries away from the current one.

public int next()

Advances the iterator to the next boundary position.

Specified by:: next in class BreakIterator

Returns:: The position of the first boundary after this one.

checkOffset

protected static final void checkOffset(int offset,
                                        CharacterIterator text)

Throw IllegalArgumentException unless begin <= offset < end.

isBoundary

public boolean isBoundary(int offset)

Returns true if the specfied position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".

Overrides:: isBoundary in class BreakIterator

Parameters:: offset - the offset to check.
Returns:: True if "offset" is a boundary position.

current

public int current()

Returns the current iteration position.

Specified by:: current in class BreakIterator

Returns:: The current iteration position.

getText

public CharacterIterator getText()

Return a CharacterIterator over the text being analyzed. This version of this method returns the actual CharacterIterator we're using internally. Changing the state of this iterator can have undefined consequences. If you need to change it, clone it first.

Specified by:: getText in class BreakIterator

Returns:: An iterator over the text being analyzed.

handlePrevious

protected int handlePrevious()

This method backs the iterator back up to a "safe position" in the text. This is a position that we know, without any context, must be a break position. The various calling methods then iterate forward from this safe position to the appropriate position to return. (For more information, see the description of buildBackwardsStateTable() in RuleBasedBreakIterator.Builder.)

lookupState

protected int lookupState(int state,
                          int category)

Given a current state and a character category, looks up the next state to transition to in the state table.

lookupBackwardState

protected int lookupBackwardState(int state,
                                  int category)

Given a current state and a character category, looks up the next state to transition to in the backwards state table.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

java.text Class DictionaryBasedBreakIterator

dictionary

categoryFlags

dictionaryCharCount

cachedBreakPositions

positionInCache

IGNORE

DictionaryBasedBreakIterator

makeBuilder

setText

first

last

previous

preceding

following

handleNext

lookupCategory

divideUpDictionaryRange

clone

equals

toString

hashCode

next

next

checkOffset

isBoundary

current

getText

handlePrevious

lookupState

lookupBackwardState

java.text
Class DictionaryBasedBreakIterator