javatools.parsers
Class Char

java.lang.Object
  extended by javatools.parsers.Char

public class Char
extends java.lang.Object

This class is part of the Java Tools (see http://mpii.de/yago-naga/javatools). It is licensed under the Creative Commons Attribution License (see http://creativecommons.org/licenses/by/3.0) by the YAGO-NAGA team (see http://mpii.de/yago-naga). This class provides static methods to decode, encode and normalize Strings.
Decoding converts the following codes to Java 16-bit characters (char):

Encoding is the inverse operation. It takes a Java 16-bit character (char) and outputs its encoding in HTML, as a backslash code, as a percentage code or in UTF8.

Normalization converts the following Unicode characters (Java 16-bit chars) to ASCII-characters in the range 0x20-0x7F:

Usage

Decoding is done by methods that "eat" a code from the string. They require as an additional parameter an integer array of length 1, in which they store the length of the code that they chopped off.
Example:

     int[] eatLength=new int[1];
     char c=eatPercentage("%2Cblah blah",eatLength);
     -->  c=','
          eatLength[0]=3  // the code was 3 characters long
  
There is a static integer array Char.eatLength, which you can use for this purpose. The methods store 0 in case the String does not start with the correct code. They store -1 in case the String starts with a corrupted code. Of course, you can use the eat... methods also to decode one single code. There are methods decode... that decode the percentage code, the UTF8-codes, the backslash codes or the Ampersand codes, respectively. The method decode(String) decodes all codes of a String.
Example:
     decode("This String contains some codes: & %2C \ u0041");
     --> "This String contains some codes: & , A"
  

Normalization is done by the method normalize(int c). It converts a Unicode character (a 16-bit Java character char) to a sequence of normal characters (i.e. characters in the range 0x20-0x7F). The transliteration may consist of multiple chars (e.g. for umlauts) and also of no chars at all (e.g. for Unicode Zero-Space-Characters).
Example:

    normalize('ä');
    --> "ae"
  
The method normalize(String) normalizes all characters in a String.
Example:
     normalize("This String contains the umlauts �, � and �");
     -->  "This String contains the umlauts Ae, Oe and Ue"
  
If the method cannot find a normalization, it calls defaultNormalizer.apply(char c). Decoding and normalizing can be combined by the method decodeAndNormalize(String s).

Encoding is done by methods called encode...(char). These methods take a character and transform it to a UTF8 code, a percentage code, an ampersand code or a backslash code, respectively. If the character is normal (i.e. in the range 0x20-0x7F), they simply return the input character without any change.
Example:

     encodePercentage('�');
     -->  "%C4"
  
There are also methods that work on entire Strings
Example:
     encodePercentage("This String contains the umlauts �, � and �");
     -->  "This String contains the umlauts %C4, %D6 and %DC;"
  

Last, this class provides the character categorization for URIs, as given in http://tools.ietf.org/html/rfc3986 . It also provides a method to encode only those characters that are not valid path component characters
Example:

     isReserved(';');
     -->  true
     encodeURIPathComponent("a: b")
     -->  "a:%20b"
  


Nested Class Summary
static interface Char.Char2StringFn
          Defines just one function from an int to a String
static interface Char.Legal
          Used for encoding selected characters
 
Field Summary
static java.util.Map<java.lang.String,java.lang.Character> ampersandMap
          Maps HTML ampersand sequences to strings
static java.util.Map<java.lang.Character,java.lang.String> charToAmpersand
          Maps a special character to a HTML ampersand sequence
static java.util.Map<java.lang.Character,java.lang.String> charToBackslash
          Maps a special character to a backslash sequence
static Char.Char2StringFn defaultNormalizer
          Called by normalize(int) in case the character cannot be normalized.
static java.util.Map<java.lang.Character,java.lang.String> normalizeMap
          Maps characters to normalizations
static java.lang.String UNKNOWN
          String returned by the default implementation of defaultNormalizer, "[?]"
 
Constructor Summary
Char()
           
 
Method Summary
static java.lang.String capitalize(java.lang.String s)
          Capitalizes words and lowercases the rest
static java.lang.String cutLast(java.lang.String s)
          Returns the String without the last character
static java.lang.StringBuilder cutLast(java.lang.StringBuilder s)
          Cuts the last character
static java.lang.String decode(java.lang.String s)
          Replaces all codes in a String by the 16 bit Unicode characters
static java.lang.String decodeAmpersand_UNKNOWN(java.lang.String s)
          Fabian: This method cannot decode numeric hexadecimal ampersand codes.
static java.lang.String decodeAmpersand(java.lang.String s)
          Decodes all ampersand sequences in the string
static java.lang.String decodeAmpersand(java.lang.String s, PositionTracker posTracker)
           
static java.lang.String decodeAndNormalize(java.lang.String s)
          Decodes all codes in a String and normalizes all chars
static java.lang.String decodeBackslash(java.lang.String s)
          Decodes all backslash characters in the string
static java.lang.String decodePercentage(java.lang.String s)
          Decodes all percentage characters in the string
static java.lang.String decodeURIPathComponent(java.lang.String s)
          Decodes a URI path component
static java.lang.String decodeUTF8(java.lang.String s)
          Decodes all UTF8 characters in the string
static char eatAmpersand(java.lang.String a, int[] n)
          Eats an HTML ampersand code from a String
static char eatBackslash(java.lang.String a, int[] n)
          Eats a backslash sequence from a String
static char eatPercentage(java.lang.String a, int[] n)
          Eats a String of the form "%xx" from a string, where xx is a hexadecimal code.
static char eatUtf8(java.lang.String a, int[] n)
          Eats a UTF8 code from a String.
static java.lang.String encodeAmpersand(char c)
          Encodes a character to an HTML-Ampersand code (if necessary)
static java.lang.String encodeAmpersand(java.lang.String c)
          Replaces non-normal characters in a String by HTML Ampersand codes
static java.lang.String encodeAmpersandToAlphanumeric(char c)
          Encodes a character to an HTML-Ampersand code (if necessary)
static java.lang.String encodeAmpersandToAlphanumeric(java.lang.String c)
          Replaces non-normal characters in a String by HTML Ampersand codes
static java.lang.String encodeBackslash(char c)
          Encodes a character to a backslash code (if necessary)
static java.lang.String encodeBackslash(java.lang.CharSequence s, Char.Legal legal)
          Encodes with backslash all illegal characters
static java.lang.String encodeBackslash(java.lang.String c)
          Replaces non-normal characters in a String by Backslash codes
static java.lang.String encodeBackslashToAlphanumeric(char c)
          Encodes a character to a backslash code (if not alphanumeric)
static java.lang.String encodeBackslashToAlphanumeric(java.lang.String c)
          Replaces non-normal characters in a String by Backslash codes (if not alphanumeric)
static java.lang.String encodeBackslashToASCII(char c)
          Encodes a character to a backslash code (if not ASCII)
static java.lang.String encodeBackslashToASCII(java.lang.String c)
          Replaces non-normal characters in a String by Backslash codes (if not ASCII)
static java.lang.String encodeHex(java.lang.String s)
          Replaces special characters in the string by hex codes (cannot be undone)
static java.lang.String encodePercentage(char c)
          Encodes a character to an Percentage code (if necessary).
static java.lang.String encodePercentage(java.lang.String c)
          Replaces non-normal characters in a String by Percentage codes.
static java.lang.String encodeURIPathComponent(char c)
          Encodes a char to percentage code, if it is not a path character in the sense of URIs
static java.lang.String encodeURIPathComponent(java.lang.String s)
          Encodes a char to percentage code, if it is not a path character in the sense of URIs
static java.lang.String encodeURIPathComponentXML(java.lang.String s)
          Encodes a char to percentage code, if it is not a path character in the sense of XMLs
static java.lang.String encodeUTF8(int c)
          Encodes a character to UTF8 (if necessary)
static java.lang.String encodeUTF8(java.lang.String c)
          Encodes a String to UTF8
static java.lang.String encodeXmlAttribute(java.lang.String str)
          Encodes a String with reserved XML characters into a valid xml string for attributes.
static boolean endsWith(java.lang.CharSequence s, java.lang.String end)
          TRUE if the Charsequence ends with the string
static java.lang.String hexAll(java.lang.String s)
          Returns the chars of a String in hex
static boolean in(char c, char a, char b)
          Tells whether a char is in a range
static boolean in(char c, java.lang.String s)
          Tells whether a char is in a string
static boolean isAlphanumeric(char c)
          Tells whether a char is alphanumeric in the sense of URIs
static boolean isEscaped(java.lang.String s)
          Tells whether a string is escaped in the sense of URIs
static boolean isGenDelim(char c)
          Tells whether a char is a general delimiter in the sense of URIs
static boolean isPchar(char c)
          Tells whether a char is a valid path component in the sense of URIs
static boolean isReserved(char c)
          Tells whether a char is reserved in the sense of URIs
static boolean isSubDelim(char c)
          Tells whether a char is a sub-delimiter in the sense of URIs
static boolean isUnreserved(char c)
          Tells whether a char is unreserved in the sense of URIs (not the same as !reserved)
static char last(java.lang.CharSequence s)
          Returns the last character of a String or 0
static java.lang.String lowCaseFirst(java.lang.String s)
          Lowcases the first character in a String
static void main(java.lang.String[] argv)
          Test routine
static java.lang.String normalize(int c)
          Normalizes a character to a String of characters in the range 0x20-0x7F.
static java.lang.String normalize(java.lang.String s)
          Normalizes all chars in a String to characters 0x20-0x7F
static java.lang.String toHTML(java.lang.String s)
          Returns an HTML-String of the String
static java.lang.CharSequence truncate(java.lang.CharSequence s, int len)
          Returns a string of the given length, fills with spaces if necessary
static java.lang.String upCaseFirst(java.lang.String s)
          Upcases the first character in a String
static int Utf8Length(char c)
          Tells from the first UTF-8 code character how long the code is.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

defaultNormalizer

public static Char.Char2StringFn defaultNormalizer
Called by normalize(int) in case the character cannot be normalized. The default implementation returns UNKNOWN. Feel free to create a new Char2StringFn and assign it to defaultNormalizer.


UNKNOWN

public static java.lang.String UNKNOWN
String returned by the default implementation of defaultNormalizer, "[?]"


charToAmpersand

public static java.util.Map<java.lang.Character,java.lang.String> charToAmpersand
Maps a special character to a HTML ampersand sequence


charToBackslash

public static java.util.Map<java.lang.Character,java.lang.String> charToBackslash
Maps a special character to a backslash sequence


ampersandMap

public static java.util.Map<java.lang.String,java.lang.Character> ampersandMap
Maps HTML ampersand sequences to strings


normalizeMap

public static java.util.Map<java.lang.Character,java.lang.String> normalizeMap
Maps characters to normalizations

Constructor Detail

Char

public Char()
Method Detail

normalize

public static java.lang.String normalize(int c)
Normalizes a character to a String of characters in the range 0x20-0x7F. Returns a String, because some characters are normalized to multiple characters (e.g. umlauts) and some characters are normalized to zero characters (e.g. special Unicode space chars). Returns null for the EndOfFile character -1


eatPercentage

public static char eatPercentage(java.lang.String a,
                                 int[] n)
Eats a String of the form "%xx" from a string, where xx is a hexadecimal code. If xx is a UTF8 code start, tries to complete the UTF8-code and decodes it.


eatAmpersand

public static char eatAmpersand(java.lang.String a,
                                int[] n)
Eats an HTML ampersand code from a String


Utf8Length

public static int Utf8Length(char c)
Tells from the first UTF-8 code character how long the code is. Returns -1 if the character is not an UTF-8 code start. Returns 1 if the character is ASCII<128


eatUtf8

public static char eatUtf8(java.lang.String a,
                           int[] n)
Eats a UTF8 code from a String. There is also a built-in way in Java that converts UTF8 to characters and back, but it does not work with all characters.


decodeUTF8

public static java.lang.String decodeUTF8(java.lang.String s)
Decodes all UTF8 characters in the string


decodePercentage

public static java.lang.String decodePercentage(java.lang.String s)
Decodes all percentage characters in the string


decodeAmpersand_UNKNOWN

public static java.lang.String decodeAmpersand_UNKNOWN(java.lang.String s)
Fabian: This method cannot decode numeric hexadecimal ampersand codes. What is its purpose? TODO


decodeAmpersand

public static java.lang.String decodeAmpersand(java.lang.String s,
                                               PositionTracker posTracker)

decodeAmpersand

public static java.lang.String decodeAmpersand(java.lang.String s)
Decodes all ampersand sequences in the string


decodeBackslash

public static java.lang.String decodeBackslash(java.lang.String s)
Decodes all backslash characters in the string


encodeBackslash

public static java.lang.String encodeBackslash(java.lang.CharSequence s,
                                               Char.Legal legal)
Encodes with backslash all illegal characters


eatBackslash

public static char eatBackslash(java.lang.String a,
                                int[] n)
Eats a backslash sequence from a String


decode

public static java.lang.String decode(java.lang.String s)
Replaces all codes in a String by the 16 bit Unicode characters


encodeUTF8

public static java.lang.String encodeUTF8(int c)
Encodes a character to UTF8 (if necessary)


encodeBackslash

public static java.lang.String encodeBackslash(char c)
Encodes a character to a backslash code (if necessary)


encodeBackslashToAlphanumeric

public static java.lang.String encodeBackslashToAlphanumeric(char c)
Encodes a character to a backslash code (if not alphanumeric)


encodeBackslashToASCII

public static java.lang.String encodeBackslashToASCII(char c)
Encodes a character to a backslash code (if not ASCII)


encodeAmpersand

public static java.lang.String encodeAmpersand(char c)
Encodes a character to an HTML-Ampersand code (if necessary)


encodeAmpersandToAlphanumeric

public static java.lang.String encodeAmpersandToAlphanumeric(char c)
Encodes a character to an HTML-Ampersand code (if necessary)


encodePercentage

public static java.lang.String encodePercentage(char c)
Encodes a character to an Percentage code (if necessary). If the character is greater than 0x80, the character is converted to a UTF8-sequence and this sequence is encoded as percentage codes.


encodeXmlAttribute

public static java.lang.String encodeXmlAttribute(java.lang.String str)
Encodes a String with reserved XML characters into a valid xml string for attributes.

Parameters:
str -
Returns:

in

public static boolean in(char c,
                         char a,
                         char b)
Tells whether a char is in a range


in

public static boolean in(char c,
                         java.lang.String s)
Tells whether a char is in a string


isAlphanumeric

public static boolean isAlphanumeric(char c)
Tells whether a char is alphanumeric in the sense of URIs


isReserved

public static boolean isReserved(char c)
Tells whether a char is reserved in the sense of URIs


isUnreserved

public static boolean isUnreserved(char c)
Tells whether a char is unreserved in the sense of URIs (not the same as !reserved)


isEscaped

public static boolean isEscaped(java.lang.String s)
Tells whether a string is escaped in the sense of URIs


isSubDelim

public static boolean isSubDelim(char c)
Tells whether a char is a sub-delimiter in the sense of URIs


isGenDelim

public static boolean isGenDelim(char c)
Tells whether a char is a general delimiter in the sense of URIs


isPchar

public static boolean isPchar(char c)
Tells whether a char is a valid path component in the sense of URIs


encodeURIPathComponent

public static java.lang.String encodeURIPathComponent(char c)
Encodes a char to percentage code, if it is not a path character in the sense of URIs


encodeURIPathComponent

public static java.lang.String encodeURIPathComponent(java.lang.String s)
Encodes a char to percentage code, if it is not a path character in the sense of URIs


encodeURIPathComponentXML

public static java.lang.String encodeURIPathComponentXML(java.lang.String s)
Encodes a char to percentage code, if it is not a path character in the sense of XMLs


decodeURIPathComponent

public static java.lang.String decodeURIPathComponent(java.lang.String s)
Decodes a URI path component


encodeUTF8

public static java.lang.String encodeUTF8(java.lang.String c)
Encodes a String to UTF8


encodeBackslash

public static java.lang.String encodeBackslash(java.lang.String c)
Replaces non-normal characters in a String by Backslash codes


encodeBackslashToAlphanumeric

public static java.lang.String encodeBackslashToAlphanumeric(java.lang.String c)
Replaces non-normal characters in a String by Backslash codes (if not alphanumeric)


encodeBackslashToASCII

public static java.lang.String encodeBackslashToASCII(java.lang.String c)
Replaces non-normal characters in a String by Backslash codes (if not ASCII)


encodeAmpersand

public static java.lang.String encodeAmpersand(java.lang.String c)
Replaces non-normal characters in a String by HTML Ampersand codes


encodeAmpersandToAlphanumeric

public static java.lang.String encodeAmpersandToAlphanumeric(java.lang.String c)
Replaces non-normal characters in a String by HTML Ampersand codes


encodePercentage

public static java.lang.String encodePercentage(java.lang.String c)
Replaces non-normal characters in a String by Percentage codes. If a character is greater than 0x80, the character is converted to a UTF8-sequence and this sequence is encoded as percentage codes.


decodeAndNormalize

public static java.lang.String decodeAndNormalize(java.lang.String s)
Decodes all codes in a String and normalizes all chars


normalize

public static java.lang.String normalize(java.lang.String s)
Normalizes all chars in a String to characters 0x20-0x7F


last

public static char last(java.lang.CharSequence s)
Returns the last character of a String or 0


cutLast

public static java.lang.String cutLast(java.lang.String s)
Returns the String without the last character


cutLast

public static java.lang.StringBuilder cutLast(java.lang.StringBuilder s)
Cuts the last character


toHTML

public static java.lang.String toHTML(java.lang.String s)
Returns an HTML-String of the String


hexAll

public static java.lang.String hexAll(java.lang.String s)
Returns the chars of a String in hex


encodeHex

public static java.lang.String encodeHex(java.lang.String s)
Replaces special characters in the string by hex codes (cannot be undone)


upCaseFirst

public static java.lang.String upCaseFirst(java.lang.String s)
Upcases the first character in a String


lowCaseFirst

public static java.lang.String lowCaseFirst(java.lang.String s)
Lowcases the first character in a String


truncate

public static java.lang.CharSequence truncate(java.lang.CharSequence s,
                                              int len)
Returns a string of the given length, fills with spaces if necessary


capitalize

public static java.lang.String capitalize(java.lang.String s)
Capitalizes words and lowercases the rest


endsWith

public static boolean endsWith(java.lang.CharSequence s,
                               java.lang.String end)
TRUE if the Charsequence ends with the string


main

public static void main(java.lang.String[] argv)
                 throws java.lang.Exception
Test routine

Throws:
java.lang.Exception