javatools.parsers
Class NameML

java.lang.Object
  extended by javatools.parsers.NameML
Direct Known Subclasses:
NameML.AbbreviationML, NameML.CompanyNameML, NameML.PersonNameML

public class NameML
extends java.lang.Object

This class is part of the Java Tools (see http://mpii.de/yago-naga/javatools). It is licensed under the Creative Commons Attribution License (see http://creativecommons.org/licenses/by/3.0) by the YAGO-NAGA team (see http://mpii.de/yago-naga). This class is a multi-language extension of the Name class. Its functionality is synonymous to that of the Name class, but given corresponding word lists it supports various languages. The class NameML represents a name. There are three sub-types (subclasses) of names: Abbreviations, person names and company names. These subclasses provide methods to access the components of the name (like the family name). Use the factory method NameML.of to create a NameML-object of the appropriate subclass. However, before you can apply any NameML method, you need to initiate the class with the path to its lanuage dependent configuration files.
Example:

 NameML.init();
 NameML.isName("Mouse");
   --> true
   NameML.isAbbreviation("PMM");
   --> true  
   NameML.isPersonName("Mickey Mouse",Language.ENGLISH);
   --> false
   NameML.couldBePersonName("Mickey Mouse",Language.ENGLISH);
   --> true
   NameML.isPersonName("Pope Mickey Mouse",Language.ENGLISH);
   --> true
   NameML.isPersonName("Pope Mickey Mouse",Language.SPANISH);
   --> false
   NameML.isPersonName("Pope Mickey Mouse",Language.GERMAN);
   --> false    
   NameML.isPersonName("Papst Mickey Mouse",Language.GERMAN);
   --> true   
   NameML.of("Prof. Dr. Fabian the Great III of Saarbruecken").describe()
   // equivalent to new PersonName(...) in this case
   -->
   PersonName
     Original: Prof. Dr. Fabian the Great III of Saarbruecken
     Titles: Prof. Dr.
     Given Name: Fabian
     Given Names: Fabian
     Family Name Prefix: null
     Attribute Prefix: the
     Family Name: null
     Attribute: Great
     Family Name Suffix: null
     Roman: III
     City: Saarbruecken
     Normalized: Fabian_Great
 
IMPORTANT: !Note that for some recognition methods the class falls back to English as the target language since not all methods have been adapted yet for multi-language support and be aware that the interface might change for methods that are not yet language-dependent! Also note that currently you need to initialize the class first by calling one of the init functions before you can use most of its functions! Otherwise it may throw null pointer errors! TODO: Turn completely into an instantiable object? then we would not need to init nor to load all languages while only using 1


Nested Class Summary
static class NameML.AbbreviationML
           
static class NameML.CompanyNameML
           
static class NameML.PersonNameML
           
 
Field Summary
static java.lang.String A
          Contains characters
static java.lang.String ANYNAME
          Holds the general default name
static java.lang.String attributePrefix
          Contains attribute Prefixes (like "the" in "Alexander the Great")
static java.util.regex.Pattern attributePrefixPattern
           
static java.lang.String B
          Contains blank
static java.lang.String BC
          Contains blank with optional comma
static java.lang.String BD
          Contains a word boundary
static java.lang.String companyNameSuffix
          Contains common company name suffixes (like "Inc")
static java.util.regex.Pattern companyNameSuffixPattern
           
static java.lang.String DG
          Contains digits
static java.lang.String directFamilyNamePrefix
          A direct family name prefix (such as "Mc")
static java.lang.String familyName
          Name component with an optional familyNamePrefix and postfix
static java.lang.String familyNamePrefix
          Contains common family name prefixes (like "von")
static java.util.regex.Pattern familyNamePrefixPattern
           
static java.lang.String familyNameSuffix
          Contains common name suffixes (like "Junior")
static java.util.regex.Pattern familyNameSuffixPattern
           
static java.lang.String givenName
          The pattern "Name[-Name]"
static java.lang.String givenNameComponent
          The pattern "Name."
static java.lang.String givenNames
          The pattern (personNameComponent+B)+
static java.lang.String H
          Contains hypens
static java.lang.String L
          Contains lower case Characters
static java.util.Map<java.lang.String,java.lang.String> languageCodes
          Language codes
static java.util.regex.Pattern laxAbbreviationPattern
          Contains the lax pattern for abbreviations
static java.util.regex.Pattern laxCompanyPattern
          Contains the pattern for companies
static java.lang.String laxName
          Contains the pattern for names.
static java.util.regex.Pattern laxNamePattern
          Contains the pattern for names.
static java.util.regex.Pattern laxPersonNamePatternDe
           
static java.util.regex.Pattern laxPersonNamePatternEn
           
static java.util.regex.Pattern laxPersonNamePatternEs
           
static java.util.regex.Pattern laxPersonNamePatternFr
           
static java.util.regex.Pattern laxPersonNamePatternIt
           
static java.util.Map<java.lang.String,java.lang.String> nationality2country
           
static java.lang.String nickName
          Nickname '...'
static java.lang.String of
          Contains the English "of"
static java.lang.String or
          Contains "|"
static java.lang.String personNameComponent
          The pattern "Name"
static java.lang.String prep
          Contains prepositions
static java.lang.String roman
          Contains romam digits
static java.util.regex.Pattern safeAbbreviationPattern
          Contains the safe pattern for abbreviations
static java.util.regex.Pattern safeCompanyPattern
          Contains the safe pattern for companies
static java.lang.String safeName
          Contains a pattern that indicates strings that are very likely to be names
static java.util.regex.Pattern safeNamePattern
          Contains a pattern that indicates strings that are very likely to be names
static java.util.regex.Pattern safeNamesPattern
          Contains a pattern that indicates strings that are very likely to be names
static java.util.regex.Pattern safeNamesPatternNoPrep
          Contains a pattern that indicates strings that are very likely to be names
static java.util.regex.Pattern safePersonNamePatternDe
           
static java.util.regex.Pattern safePersonNamePatternEn
           
static java.util.regex.Pattern safePersonNamePatternEs
           
static java.util.regex.Pattern safePersonNamePatternFr
           
static java.util.regex.Pattern safePersonNamePatternIt
           
static java.lang.String teamName
          team name
static java.util.regex.Pattern teamNamePattern
           
static java.util.regex.Pattern titlePatternDe
           
static java.util.regex.Pattern titlePatternEn
          Matches common titles (like "Mr.")
static java.util.regex.Pattern titlePatternEs
           
static java.util.regex.Pattern titlePatternFr
           
static java.util.regex.Pattern titlePatternIt
           
static java.lang.String U
          Contains upper case Characters
static java.util.Map<java.lang.String,java.lang.String> usStates
           
 
Method Summary
static java.lang.String c(java.lang.String s)
          Capturing group
static boolean couldBeAbbreviation(java.lang.String word)
          Tells whether a string could be abbreviation.
static boolean couldBeCompanyName(java.lang.String s)
          Tells if the string could be a company name
static boolean couldBeName(java.lang.String s)
          Tells whether a String could possibly be a name
static boolean couldBePersonName(java.lang.String s, Language lang)
          Returns true if it is possible that the string is a person name
 java.lang.String describe()
          Returns a description
static java.io.InputStream getConfigFileStream(java.lang.String configfile)
           
static void init()
          Simply call this function to initialize NameML with the default values
static void init(NonsharedParameters params)
           
static void init(java.lang.String configPath)
          If you like to use your own stopword lists etc.
static boolean isAbbreviation(java.lang.String word)
          Tells whether a string is an abbreviation with high probability
static boolean isAttributePrefix(java.lang.String s)
          Says whether this String is an attribute Prefix (like "the" in "Alexander the Great")
static boolean isCompanyName(java.lang.String s)
          Tells if the string is a company name with high probability
static boolean isCompanyNameSuffix(java.lang.String s)
          Says whether this String is a company name suffix
static boolean isFamilyNamePrefix(java.lang.String s)
          Says whether this String is a family name prefix
static boolean isLanguage(java.lang.String s)
          Returns TRUE for languages
static boolean isLanguageCode(java.lang.String s)
          Returns TRUE for language codes
static boolean isName(java.lang.String s)
          Tells whether a String is a name with high probability
static boolean isNames(java.lang.String s)
          Tells whether a String is a sequence of names with high probability
static boolean isNation(java.lang.String s)
          Returns TRUE for nations
static boolean isNationality(java.lang.String s)
          Returns TRUE for nationalities
static boolean isPersonName(java.lang.String m, Language lang)
          Returns true if it is highly probable that the string is a person name.
static boolean isPersonNameSuffix(java.lang.String s)
          Says whether this String is a person name suffix
static boolean isStopWord(java.lang.String w, Language l)
          TRUE for stopwords
static boolean isTitle(java.lang.String s, Language lang)
          Says whether this String is a title
static boolean isUSState(java.lang.String s)
          Returns TRUE for US States
static boolean isUSStateAbbreviation(java.lang.String s)
          Returns TRUE for US State abbreviations
static java.lang.String languageForCode(java.lang.String s)
          Returns the language for a code (or NULL)
static void main(java.lang.String[] argv)
          Test routine
static java.lang.String mul(java.lang.String s)
          Repeats the token with blanks one or more times
static java.lang.String mulHyp(java.lang.String s)
          Repeats the token with hyphens one or more times
static java.lang.String nationForNationality(java.lang.String s)
          Returns the nation for a nationality (or NULL)
 java.lang.String normalize()
          Returns the letters and digits of the original name (eliminates punctuation)
static NameML of(java.lang.String s, Language lang)
          Factory pattern
static java.lang.String opt(java.lang.String s)
          optional component
static java.lang.String optMul(java.lang.String s)
          optional multiple component
static java.lang.String or(java.lang.String s1, java.lang.String s2)
          alternative
 java.lang.String original()
          Returns the original name
static java.util.List<java.lang.String> readTextFileLines(java.lang.String configFile, java.lang.String encoding)
          Read an entire text file as a list of strings.
static java.util.Set<java.lang.String> readTextFileLinesSet(java.lang.String configFile)
          Read set from an UTF-8 text file, ignoring lines starting with "##"
 java.lang.String toString()
          Returns the original name
static java.lang.String unabbreviateUSState(java.lang.String s)
          Returns the US sate for an abbreviation (or NULL)
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ANYNAME

public static final java.lang.String ANYNAME
Holds the general default name

See Also:
Constant Field Values

roman

public static java.lang.String roman
Contains romam digits


of

public static java.lang.String of
Contains the English "of"


U

public static final java.lang.String U
Contains upper case Characters

See Also:
Constant Field Values

L

public static final java.lang.String L
Contains lower case Characters

See Also:
Constant Field Values

A

public static final java.lang.String A
Contains characters

See Also:
Constant Field Values

B

public static final java.lang.String B
Contains blank

See Also:
Constant Field Values

BD

public static final java.lang.String BD
Contains a word boundary

See Also:
Constant Field Values

BC

public static final java.lang.String BC
Contains blank with optional comma

See Also:
Constant Field Values

DG

public static final java.lang.String DG
Contains digits

See Also:
Constant Field Values

H

public static final java.lang.String H
Contains hypens

See Also:
Constant Field Values

or

public static final java.lang.String or
Contains "|"

See Also:
Constant Field Values

familyNamePrefix

public static final java.lang.String familyNamePrefix
Contains common family name prefixes (like "von")

See Also:
Constant Field Values

familyNamePrefixPattern

public static final java.util.regex.Pattern familyNamePrefixPattern

attributePrefix

public static java.lang.String attributePrefix
Contains attribute Prefixes (like "the" in "Alexander the Great")


attributePrefixPattern

public static java.util.regex.Pattern attributePrefixPattern

familyNameSuffix

public static final java.lang.String familyNameSuffix
Contains common name suffixes (like "Junior")

See Also:
Constant Field Values

familyNameSuffixPattern

public static final java.util.regex.Pattern familyNameSuffixPattern

titlePatternEn

public static java.util.regex.Pattern titlePatternEn
Matches common titles (like "Mr.")


titlePatternDe

public static java.util.regex.Pattern titlePatternDe

titlePatternFr

public static java.util.regex.Pattern titlePatternFr

titlePatternEs

public static java.util.regex.Pattern titlePatternEs

titlePatternIt

public static java.util.regex.Pattern titlePatternIt

companyNameSuffix

public static final java.lang.String companyNameSuffix
Contains common company name suffixes (like "Inc")

See Also:
Constant Field Values

companyNameSuffixPattern

public static final java.util.regex.Pattern companyNameSuffixPattern

teamName

public static final java.lang.String teamName
team name

See Also:
Constant Field Values

teamNamePattern

public static final java.util.regex.Pattern teamNamePattern

prep

public static final java.lang.String prep
Contains prepositions

See Also:
Constant Field Values

laxName

public static final java.lang.String laxName
Contains the pattern for names. Practically everything is a name if it starts with an uppercase letter

See Also:
Constant Field Values

laxNamePattern

public static final java.util.regex.Pattern laxNamePattern
Contains the pattern for names. Practically everything is a name if it starts with an uppercase letter


safeName

public static final java.lang.String safeName
Contains a pattern that indicates strings that are very likely to be names

See Also:
Constant Field Values

safeNamePattern

public static final java.util.regex.Pattern safeNamePattern
Contains a pattern that indicates strings that are very likely to be names


safeNamesPattern

public static final java.util.regex.Pattern safeNamesPattern
Contains a pattern that indicates strings that are very likely to be names


safeNamesPatternNoPrep

public static final java.util.regex.Pattern safeNamesPatternNoPrep
Contains a pattern that indicates strings that are very likely to be names


laxAbbreviationPattern

public static final java.util.regex.Pattern laxAbbreviationPattern
Contains the lax pattern for abbreviations


safeAbbreviationPattern

public static final java.util.regex.Pattern safeAbbreviationPattern
Contains the safe pattern for abbreviations


laxCompanyPattern

public static final java.util.regex.Pattern laxCompanyPattern
Contains the pattern for companies


safeCompanyPattern

public static final java.util.regex.Pattern safeCompanyPattern
Contains the safe pattern for companies


directFamilyNamePrefix

public static final java.lang.String directFamilyNamePrefix
A direct family name prefix (such as "Mc")

See Also:
Constant Field Values

personNameComponent

public static final java.lang.String personNameComponent
The pattern "Name"

See Also:
Constant Field Values

givenNameComponent

public static final java.lang.String givenNameComponent
The pattern "Name."


givenName

public static final java.lang.String givenName
The pattern "Name[-Name]"


givenNames

public static final java.lang.String givenNames
The pattern (personNameComponent+B)+


familyName

public static final java.lang.String familyName
Name component with an optional familyNamePrefix and postfix


nickName

public static final java.lang.String nickName
Nickname '...'

See Also:
Constant Field Values

laxPersonNamePatternEn

public static java.util.regex.Pattern laxPersonNamePatternEn

laxPersonNamePatternDe

public static java.util.regex.Pattern laxPersonNamePatternDe

laxPersonNamePatternEs

public static java.util.regex.Pattern laxPersonNamePatternEs

laxPersonNamePatternFr

public static java.util.regex.Pattern laxPersonNamePatternFr

laxPersonNamePatternIt

public static java.util.regex.Pattern laxPersonNamePatternIt

safePersonNamePatternEn

public static java.util.regex.Pattern safePersonNamePatternEn

safePersonNamePatternDe

public static java.util.regex.Pattern safePersonNamePatternDe

safePersonNamePatternEs

public static java.util.regex.Pattern safePersonNamePatternEs

safePersonNamePatternFr

public static java.util.regex.Pattern safePersonNamePatternFr

safePersonNamePatternIt

public static java.util.regex.Pattern safePersonNamePatternIt

usStates

public static java.util.Map<java.lang.String,java.lang.String> usStates

languageCodes

public static java.util.Map<java.lang.String,java.lang.String> languageCodes
Language codes


nationality2country

public static java.util.Map<java.lang.String,java.lang.String> nationality2country
Method Detail

init

public static final void init(NonsharedParameters params)

init

public static final void init(java.lang.String configPath)
If you like to use your own stopword lists etc. (see javatools.resources.parsing) then you can set a path where NameML will look for such files instead of the default resource location (javatools.resources.parsing). (To make sure that you cover all files necessary in your own word list set, best start by copying the files from src/javatools/resources/parsing/.)

Parameters:
configPath - The path that contains all word lists.

init

public static final void init()
Simply call this function to initialize NameML with the default values


mul

public static java.lang.String mul(java.lang.String s)
Repeats the token with blanks one or more times


mulHyp

public static java.lang.String mulHyp(java.lang.String s)
Repeats the token with hyphens one or more times


opt

public static java.lang.String opt(java.lang.String s)
optional component


optMul

public static java.lang.String optMul(java.lang.String s)
optional multiple component


or

public static java.lang.String or(java.lang.String s1,
                                  java.lang.String s2)
alternative


c

public static java.lang.String c(java.lang.String s)
Capturing group


isFamilyNamePrefix

public static boolean isFamilyNamePrefix(java.lang.String s)
Says whether this String is a family name prefix


isAttributePrefix

public static boolean isAttributePrefix(java.lang.String s)
Says whether this String is an attribute Prefix (like "the" in "Alexander the Great")


isPersonNameSuffix

public static boolean isPersonNameSuffix(java.lang.String s)
Says whether this String is a person name suffix


isTitle

public static boolean isTitle(java.lang.String s,
                              Language lang)
Says whether this String is a title


isCompanyNameSuffix

public static boolean isCompanyNameSuffix(java.lang.String s)
Says whether this String is a company name suffix


isName

public static boolean isName(java.lang.String s)
Tells whether a String is a name with high probability


isNames

public static boolean isNames(java.lang.String s)
Tells whether a String is a sequence of names with high probability


couldBeName

public static boolean couldBeName(java.lang.String s)
Tells whether a String could possibly be a name


isStopWord

public static boolean isStopWord(java.lang.String w,
                                 Language l)
TRUE for stopwords


toString

public java.lang.String toString()
Returns the original name

Overrides:
toString in class java.lang.Object

normalize

public java.lang.String normalize()
Returns the letters and digits of the original name (eliminates punctuation)


describe

public java.lang.String describe()
Returns a description


original

public java.lang.String original()
Returns the original name


readTextFileLines

public static java.util.List<java.lang.String> readTextFileLines(java.lang.String configFile,
                                                                 java.lang.String encoding)
                                                          throws java.io.IOException
Read an entire text file as a list of strings. The strings do not include a terminating new line character.

Parameters:
inputFile - text file to open
encoding - character encoding name (as used by Java) or null for UTF-8
Returns:
contents of file
Throws:
java.io.IOException

getConfigFileStream

public static java.io.InputStream getConfigFileStream(java.lang.String configfile)
                                               throws java.io.FileNotFoundException
Throws:
java.io.FileNotFoundException

readTextFileLinesSet

public static java.util.Set<java.lang.String> readTextFileLinesSet(java.lang.String configFile)
Read set from an UTF-8 text file, ignoring lines starting with "##"

Parameters:
inputFile -
Returns:
set of strings

isAbbreviation

public static boolean isAbbreviation(java.lang.String word)
Tells whether a string is an abbreviation with high probability


couldBeAbbreviation

public static boolean couldBeAbbreviation(java.lang.String word)
Tells whether a string could be abbreviation.


isCompanyName

public static boolean isCompanyName(java.lang.String s)
Tells if the string is a company name with high probability


couldBeCompanyName

public static boolean couldBeCompanyName(java.lang.String s)
Tells if the string could be a company name


couldBePersonName

public static boolean couldBePersonName(java.lang.String s,
                                        Language lang)
Returns true if it is possible that the string is a person name


isPersonName

public static boolean isPersonName(java.lang.String m,
                                   Language lang)
Returns true if it is highly probable that the string is a person name.


isUSState

public static boolean isUSState(java.lang.String s)
Returns TRUE for US States


isUSStateAbbreviation

public static boolean isUSStateAbbreviation(java.lang.String s)
Returns TRUE for US State abbreviations


unabbreviateUSState

public static java.lang.String unabbreviateUSState(java.lang.String s)
Returns the US sate for an abbreviation (or NULL)


isLanguage

public static boolean isLanguage(java.lang.String s)
Returns TRUE for languages


isLanguageCode

public static boolean isLanguageCode(java.lang.String s)
Returns TRUE for language codes


languageForCode

public static java.lang.String languageForCode(java.lang.String s)
Returns the language for a code (or NULL)


isNation

public static boolean isNation(java.lang.String s)
Returns TRUE for nations


isNationality

public static boolean isNationality(java.lang.String s)
Returns TRUE for nationalities


nationForNationality

public static java.lang.String nationForNationality(java.lang.String s)
Returns the nation for a nationality (or NULL)


of

public static NameML of(java.lang.String s,
                        Language lang)
Factory pattern


main

public static void main(java.lang.String[] argv)
                 throws java.lang.Exception
Test routine

Throws:
java.lang.Exception