|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object java.text.BreakIterator java.text.RuleBasedBreakIterator
A subclass of BreakIterator whose behavior is specified using a list of rules.
There are two kinds of rules, which are separated by semicolons: substitutions and regular expressions.
A substitution rule defines a name that can be used in place of an expression. It consists of a name, which is a string of characters contained in angle brackets, an equals sign, and an expression. (There can be no whitespace on either side of the equals sign.) To keep its syntactic meaning intact, the expression must be enclosed in parentheses or square brackets. A substitution is visible after its definition, and is filled in using simple textual substitution. Substitution definitions can contain other substitutions, as long as those substitutions have been defined first. Substitutions are generally used to make the regular expressions (which can get quite complex) shorted and easier to read. They typically define either character categories or commonly-used subexpressions.
There is one special substitution. If the description defines a substitution called "<ignore>", the expression must be a [] expression, and the expression defines a set of characters (the "ignore characters") that will be transparent to the BreakIterator. A sequence of characters will break the same way it would if any ignore characters it contains are taken out. Break positions never occur befoer ignore characters.
A regular expression uses a subset of the normal Unix regular-expression syntax, and defines a sequence of characters to be kept together. With one significant exception, the iterator uses a longest-possible-match algorithm when matching text to regular expressions. The iterator also treats descriptions containing multiple regular expressions as if they were ORed together (i.e., as if they were separated by |).
The special characters recognized by the regular-expression parser are as follows:
* Specifies that the expression preceding the asterisk may occur any number of times (including not at all). {} Encloses a sequence of characters that is optional. () Encloses a sequence of characters. If followed by *, the sequence repeats. Otherwise, the parentheses are just a grouping device and a way to delimit the ends of expressions containing |. | Separates two alternative sequences of characters. Either one sequence or the other, but not both, matches this expression. The | character can only occur inside (). . Matches any character. *? Specifies a non-greedy asterisk. *? works the same way as *, except when there is overlap between the last group of characters in the expression preceding the * and the first group of characters following the *. When there is this kind of overlap, * will match the longest sequence of characters that match the expression before the *, and *? will match the shortest sequence of characters matching the expression before the *?. For example, if you have "xxyxyyyxyxyxxyxyxyy" in the text, "x[xy]*x" will match through to the last x (i.e., "xxyxyyyxyxyxxyxyxyy", but "x[xy]*?x" will only match the first two xes ("xxyxyyyxyxyxxyxyxyy"). [] Specifies a group of alternative characters. A [] expression will match any single character that is specified in the [] expression. For more on the syntax of [] expressions, see below. / Specifies where the break position should go if text matches this expression. (e.g., "[a-z]*/[:Zs:]*[1-0]" will match if the iterator sees a run of letters, followed by a run of whitespace, followed by a digit, but the break position will actually go before the whitespace). Expressions that don't contain / put the break position at the end of the matching text. \ Escape character. The \ itself is ignored, but causes the next character to be treated as literal character. This has no effect for many characters, but for the characters listed above, this deprives them of their special meaning. (There are no special escape sequences for Unicode characters, or tabs and newlines; these are all handled by a higher-level protocol. In a Java string, "\n" will be converted to a literal newline character by the time the regular-expression parser sees it. Of course, this means that \ sequences that are visible to the regexp parser must be written as \\ when inside a Java string.) All characters in the ASCII range except for letters, digits, and control characters are reserved characters to the parser and must be preceded by \ even if they currently don't mean anything. ! If ! appears at the beginning of a regular expression, it tells the regexp parser that this expression specifies the backwards-iteration behavior of the iterator, and not its normal iteration behavior. This is generally only used in situations where the automatically-generated backwards-iteration brhavior doesn't produce satisfactory results and must be supplemented with extra client-specified rules. (all others) All other characters are treated as literal characters, which must match the corresponding character(s) in the text exactly.
Within a [] expression, a number of other special characters can be used to specify groups of characters:
- Specifies a range of matching characters. For example "[a-p]" matches all lowercase Latin letters from a to p (inclusive). The - sign specifies ranges of continuous Unicode numeric values, not ranges of characters in a language's alphabetical order: "[a-z]" doesn't include capital letters, nor does it include accented letters such as a-umlaut. :: A pair of colons containing a one- or two-letter code matches all characters in the corresponding Unicode category. The two-letter codes are the same as the two-letter codes in the Unicode database (for example, "[:Sc::Sm:]" matches all currency symbols and all math symbols). Specifying a one-letter code is the same as specifying all two-letter codes that begin with that letter (for example, "[:L:]" matches all letters, and is equivalent to "[:Lu::Ll::Lo::Lm::Lt:]"). Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal within colons. [] [] expressions can nest. This has no effect, except when used in conjunction with the ^ token. ^ Excludes the character (or the characters in the [] expression) following it from the group of characters. For example, "[a-z^p]" matches all Latin lowercase letters except p. "[:L:^[\u4e00-\u9fff]]" matches all letters except the Han ideographs. (all others) All other characters are treated as literal characters. (For example, "[aeiou]" specifies just the letters a, e, i, o, and u.)
For a more complete explanation, see http://www.ibm.com/java/education/boundaries/boundaries.html. For examples, see the resource data (which is annotated).
Nested Class Summary | |
protected class |
RuleBasedBreakIterator.Builder
The Builder class has the job of constructing a RuleBasedBreakIterator from a textual description. |
private static class |
RuleBasedBreakIterator.SafeCharIterator
|
Nested classes inherited from class java.text.BreakIterator |
|
Field Summary | |
private short[] |
backwardsStateTable
The table of state transitions used to sync up the iterator with the text in backwards and random-access iteration |
private sun.text.CompactByteArray |
charCategoryTable
A table that indexes from character values to character category numbers |
private String |
description
The textual description this iterator was created from |
private boolean[] |
endStates
A list of flags indicating which states in the state table are accepting ("end") states |
protected static byte |
IGNORE
A token used as a character-category value to identify ignore characters |
private boolean[] |
lookaheadStates
A list of flags indicating which states in the state table are lookahead states (states which turn lookahead on and off) |
private int |
numCategories
The number of character categories (and, thus, the number of columns in the state tables) |
private static short |
START_STATE
The state number of the starting state |
private short[] |
stateTable
The table of state transitions used for forward iteration |
private static short |
STOP_STATE
The state-transition value indicating "stop" |
private CharacterIterator |
text
The character iterator through which this BreakIterator accesses the text |
Fields inherited from class java.text.BreakIterator |
DONE |
Constructor Summary | |
RuleBasedBreakIterator(String description)
Constructs a RuleBasedBreakIterator according to the description provided. |
Method Summary | |
protected static void |
checkOffset(int offset,
CharacterIterator text)
Throw IllegalArgumentException unless begin <= offset < end. |
Object |
clone()
Clones this iterator. |
int |
current()
Returns the current iteration position. |
boolean |
equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text. |
int |
first()
Sets the current iteration position to the beginning of the text. |
int |
following(int offset)
Sets the iterator to refer to the first boundary position following the specified position. |
CharacterIterator |
getText()
Return a CharacterIterator over the text being analyzed. |
protected int |
handleNext()
This method is the actual implementation of the next() method. |
protected int |
handlePrevious()
This method backs the iterator back up to a "safe position" in the text. |
int |
hashCode()
Compute a hashcode for this BreakIterator |
boolean |
isBoundary(int offset)
Returns true if the specfied position is a boundary position. |
int |
last()
Sets the current iteration position to the end of the text. |
protected int |
lookupBackwardState(int state,
int category)
Given a current state and a character category, looks up the next state to transition to in the backwards state table. |
protected int |
lookupCategory(char c)
Looks up a character's category (i.e., its category for breaking purposes, not its Unicode category) |
protected int |
lookupState(int state,
int category)
Given a current state and a character category, looks up the next state to transition to in the state table. |
protected RuleBasedBreakIterator.Builder |
makeBuilder()
Creates a Builder. |
int |
next()
Advances the iterator to the next boundary position. |
int |
next(int n)
Advances the iterator either forward or backward the specified number of steps. |
int |
preceding(int offset)
Sets the iterator to refer to the last boundary position before the specified position. |
int |
previous()
Advances the iterator backwards, to the last boundary preceding this one. |
void |
setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text. |
String |
toString()
Returns the description used to create this iterator |
Methods inherited from class java.text.BreakIterator |
getAvailableLocales, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getSentenceInstance, getSentenceInstance, getWordInstance, getWordInstance, setText |
Methods inherited from class java.lang.Object |
finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
protected static final byte IGNORE
private static final short START_STATE
private static final short STOP_STATE
private String description
private sun.text.CompactByteArray charCategoryTable
private short[] stateTable
private short[] backwardsStateTable
private boolean[] endStates
private boolean[] lookaheadStates
private int numCategories
private CharacterIterator text
Constructor Detail |
public RuleBasedBreakIterator(String description)
Method Detail |
protected RuleBasedBreakIterator.Builder makeBuilder()
public Object clone()
clone
in class BreakIterator
public boolean equals(Object that)
equals
in class Object
that
- the reference object with which to compare.
true
if this object is the same as the obj
argument; false
otherwise.Object.hashCode()
,
Hashtable
public String toString()
toString
in class Object
public int hashCode()
hashCode
in class Object
Object.equals(java.lang.Object)
,
Hashtable
public int first()
first
in class BreakIterator
public int last()
last
in class BreakIterator
public int next(int n)
next
in class BreakIterator
n
- The number of steps to move. The sign indicates the direction
(negative is backwards, and positive is forwards).
public int next()
next
in class BreakIterator
public int previous()
previous
in class BreakIterator
protected static final void checkOffset(int offset, CharacterIterator text)
public int following(int offset)
following
in class BreakIterator
offset
- the offset to begin scanning. Valid values
are determined by the CharacterIterator passed to
setText(). Invalid values cause
an IllegalArgumentException to be thrown.
public int preceding(int offset)
preceding
in class BreakIterator
offset
- the offset to begin scanning. Valid values are
determined by the CharacterIterator passed to setText().
Invalid values cause an IllegalArgumentException to be thrown.
public boolean isBoundary(int offset)
isBoundary
in class BreakIterator
offset
- the offset to check.
public int current()
current
in class BreakIterator
public CharacterIterator getText()
getText
in class BreakIterator
public void setText(CharacterIterator newText)
setText
in class BreakIterator
newText
- An iterator over the text to analyze.protected int handleNext()
protected int handlePrevious()
protected int lookupCategory(char c)
protected int lookupState(int state, int category)
protected int lookupBackwardState(int state, int category)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |