|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object javatools.parsers.NameML
public class NameML
This class is part of the Java Tools (see
http://mpii.de/yago-naga/javatools). It is licensed under the Creative
Commons Attribution License (see http://creativecommons.org/licenses/by/3.0)
by the YAGO-NAGA team (see http://mpii.de/yago-naga).
This class is a multi-language extension of the Name class. Its functionality
is synonymous to that of the Name class, but given corresponding word lists
it supports various languages.
The class NameML represents a name. There are three sub-types (subclasses) of
names: Abbreviations, person names and company names. These subclasses
provide methods to access the components of the name (like the family name).
Use the factory method NameML.of to create a NameML-object of the appropriate
subclass. However, before you can apply any NameML method, you need to
initiate the class with the path to its lanuage dependent configuration
files.
Example:
NameML.init(); NameML.isName("Mouse"); --> true NameML.isAbbreviation("PMM"); --> true NameML.isPersonName("Mickey Mouse",Language.ENGLISH); --> false NameML.couldBePersonName("Mickey Mouse",Language.ENGLISH); --> true NameML.isPersonName("Pope Mickey Mouse",Language.ENGLISH); --> true NameML.isPersonName("Pope Mickey Mouse",Language.SPANISH); --> false NameML.isPersonName("Pope Mickey Mouse",Language.GERMAN); --> false NameML.isPersonName("Papst Mickey Mouse",Language.GERMAN); --> true NameML.of("Prof. Dr. Fabian the Great III of Saarbruecken").describe() // equivalent to new PersonName(...) in this case --> PersonName Original: Prof. Dr. Fabian the Great III of Saarbruecken Titles: Prof. Dr. Given Name: Fabian Given Names: Fabian Family Name Prefix: null Attribute Prefix: the Family Name: null Attribute: Great Family Name Suffix: null Roman: III City: Saarbruecken Normalized: Fabian_GreatIMPORTANT: !Note that for some recognition methods the class falls back to English as the target language since not all methods have been adapted yet for multi-language support and be aware that the interface might change for methods that are not yet language-dependent! Also note that currently you need to initialize the class first by calling one of the init functions before you can use most of its functions! Otherwise it may throw null pointer errors! TODO: Turn completely into an instantiable object? then we would not need to init nor to load all languages while only using 1
Nested Class Summary | |
---|---|
static class |
NameML.AbbreviationML
|
static class |
NameML.CompanyNameML
|
static class |
NameML.PersonNameML
|
Field Summary | |
---|---|
static java.lang.String |
A
Contains characters |
static java.lang.String |
ANYNAME
Holds the general default name |
static java.lang.String |
attributePrefix
Contains attribute Prefixes (like "the" in "Alexander the Great") |
static java.util.regex.Pattern |
attributePrefixPattern
|
static java.lang.String |
B
Contains blank |
static java.lang.String |
BC
Contains blank with optional comma |
static java.lang.String |
BD
Contains a word boundary |
static java.lang.String |
companyNameSuffix
Contains common company name suffixes (like "Inc") |
static java.util.regex.Pattern |
companyNameSuffixPattern
|
static java.lang.String |
DG
Contains digits |
static java.lang.String |
directFamilyNamePrefix
A direct family name prefix (such as "Mc") |
static java.lang.String |
familyName
Name component with an optional familyNamePrefix and postfix |
static java.lang.String |
familyNamePrefix
Contains common family name prefixes (like "von") |
static java.util.regex.Pattern |
familyNamePrefixPattern
|
static java.lang.String |
familyNameSuffix
Contains common name suffixes (like "Junior") |
static java.util.regex.Pattern |
familyNameSuffixPattern
|
static java.lang.String |
givenName
The pattern "Name[-Name]" |
static java.lang.String |
givenNameComponent
The pattern "Name." |
static java.lang.String |
givenNames
The pattern (personNameComponent+B)+ |
static java.lang.String |
H
Contains hypens |
static java.lang.String |
L
Contains lower case Characters |
static java.util.Map<java.lang.String,java.lang.String> |
languageCodes
Language codes |
static java.util.regex.Pattern |
laxAbbreviationPattern
Contains the lax pattern for abbreviations |
static java.util.regex.Pattern |
laxCompanyPattern
Contains the pattern for companies |
static java.lang.String |
laxName
Contains the pattern for names. |
static java.util.regex.Pattern |
laxNamePattern
Contains the pattern for names. |
static java.util.regex.Pattern |
laxPersonNamePatternDe
|
static java.util.regex.Pattern |
laxPersonNamePatternEn
|
static java.util.regex.Pattern |
laxPersonNamePatternEs
|
static java.util.regex.Pattern |
laxPersonNamePatternFr
|
static java.util.regex.Pattern |
laxPersonNamePatternIt
|
static java.util.Map<java.lang.String,java.lang.String> |
nationality2country
|
static java.lang.String |
nickName
Nickname '...' |
static java.lang.String |
of
Contains the English "of" |
static java.lang.String |
or
Contains "|" |
static java.lang.String |
personNameComponent
The pattern "Name" |
static java.lang.String |
prep
Contains prepositions |
static java.lang.String |
roman
Contains romam digits |
static java.util.regex.Pattern |
safeAbbreviationPattern
Contains the safe pattern for abbreviations |
static java.util.regex.Pattern |
safeCompanyPattern
Contains the safe pattern for companies |
static java.lang.String |
safeName
Contains a pattern that indicates strings that are very likely to be names |
static java.util.regex.Pattern |
safeNamePattern
Contains a pattern that indicates strings that are very likely to be names |
static java.util.regex.Pattern |
safeNamesPattern
Contains a pattern that indicates strings that are very likely to be names |
static java.util.regex.Pattern |
safeNamesPatternNoPrep
Contains a pattern that indicates strings that are very likely to be names |
static java.util.regex.Pattern |
safePersonNamePatternDe
|
static java.util.regex.Pattern |
safePersonNamePatternEn
|
static java.util.regex.Pattern |
safePersonNamePatternEs
|
static java.util.regex.Pattern |
safePersonNamePatternFr
|
static java.util.regex.Pattern |
safePersonNamePatternIt
|
static java.lang.String |
teamName
team name |
static java.util.regex.Pattern |
teamNamePattern
|
static java.util.regex.Pattern |
titlePatternDe
|
static java.util.regex.Pattern |
titlePatternEn
Matches common titles (like "Mr.") |
static java.util.regex.Pattern |
titlePatternEs
|
static java.util.regex.Pattern |
titlePatternFr
|
static java.util.regex.Pattern |
titlePatternIt
|
static java.lang.String |
U
Contains upper case Characters |
static java.util.Map<java.lang.String,java.lang.String> |
usStates
|
Method Summary | |
---|---|
static java.lang.String |
c(java.lang.String s)
Capturing group |
static boolean |
couldBeAbbreviation(java.lang.String word)
Tells whether a string could be abbreviation. |
static boolean |
couldBeCompanyName(java.lang.String s)
Tells if the string could be a company name |
static boolean |
couldBeName(java.lang.String s)
Tells whether a String could possibly be a name |
static boolean |
couldBePersonName(java.lang.String s,
Language lang)
Returns true if it is possible that the string is a person name |
java.lang.String |
describe()
Returns a description |
static java.io.InputStream |
getConfigFileStream(java.lang.String configfile)
|
static void |
init()
Simply call this function to initialize NameML with the default values |
static void |
init(NonsharedParameters params)
|
static void |
init(java.lang.String configPath)
If you like to use your own stopword lists etc. |
static boolean |
isAbbreviation(java.lang.String word)
Tells whether a string is an abbreviation with high probability |
static boolean |
isAttributePrefix(java.lang.String s)
Says whether this String is an attribute Prefix (like "the" in "Alexander the Great") |
static boolean |
isCompanyName(java.lang.String s)
Tells if the string is a company name with high probability |
static boolean |
isCompanyNameSuffix(java.lang.String s)
Says whether this String is a company name suffix |
static boolean |
isFamilyNamePrefix(java.lang.String s)
Says whether this String is a family name prefix |
static boolean |
isLanguage(java.lang.String s)
Returns TRUE for languages |
static boolean |
isLanguageCode(java.lang.String s)
Returns TRUE for language codes |
static boolean |
isName(java.lang.String s)
Tells whether a String is a name with high probability |
static boolean |
isNames(java.lang.String s)
Tells whether a String is a sequence of names with high probability |
static boolean |
isNation(java.lang.String s)
Returns TRUE for nations |
static boolean |
isNationality(java.lang.String s)
Returns TRUE for nationalities |
static boolean |
isPersonName(java.lang.String m,
Language lang)
Returns true if it is highly probable that the string is a person name. |
static boolean |
isPersonNameSuffix(java.lang.String s)
Says whether this String is a person name suffix |
static boolean |
isStopWord(java.lang.String w,
Language l)
TRUE for stopwords |
static boolean |
isTitle(java.lang.String s,
Language lang)
Says whether this String is a title |
static boolean |
isUSState(java.lang.String s)
Returns TRUE for US States |
static boolean |
isUSStateAbbreviation(java.lang.String s)
Returns TRUE for US State abbreviations |
static java.lang.String |
languageForCode(java.lang.String s)
Returns the language for a code (or NULL) |
static void |
main(java.lang.String[] argv)
Test routine |
static java.lang.String |
mul(java.lang.String s)
Repeats the token with blanks one or more times |
static java.lang.String |
mulHyp(java.lang.String s)
Repeats the token with hyphens one or more times |
static java.lang.String |
nationForNationality(java.lang.String s)
Returns the nation for a nationality (or NULL) |
java.lang.String |
normalize()
Returns the letters and digits of the original name (eliminates punctuation) |
static NameML |
of(java.lang.String s,
Language lang)
Factory pattern |
static java.lang.String |
opt(java.lang.String s)
optional component |
static java.lang.String |
optMul(java.lang.String s)
optional multiple component |
static java.lang.String |
or(java.lang.String s1,
java.lang.String s2)
alternative |
java.lang.String |
original()
Returns the original name |
static java.util.List<java.lang.String> |
readTextFileLines(java.lang.String configFile,
java.lang.String encoding)
Read an entire text file as a list of strings. |
static java.util.Set<java.lang.String> |
readTextFileLinesSet(java.lang.String configFile)
Read set from an UTF-8 text file, ignoring lines starting with "##" |
java.lang.String |
toString()
Returns the original name |
static java.lang.String |
unabbreviateUSState(java.lang.String s)
Returns the US sate for an abbreviation (or NULL) |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ANYNAME
public static java.lang.String roman
public static java.lang.String of
public static final java.lang.String U
public static final java.lang.String L
public static final java.lang.String A
public static final java.lang.String B
public static final java.lang.String BD
public static final java.lang.String BC
public static final java.lang.String DG
public static final java.lang.String H
public static final java.lang.String or
public static final java.lang.String familyNamePrefix
public static final java.util.regex.Pattern familyNamePrefixPattern
public static java.lang.String attributePrefix
public static java.util.regex.Pattern attributePrefixPattern
public static final java.lang.String familyNameSuffix
public static final java.util.regex.Pattern familyNameSuffixPattern
public static java.util.regex.Pattern titlePatternEn
public static java.util.regex.Pattern titlePatternDe
public static java.util.regex.Pattern titlePatternFr
public static java.util.regex.Pattern titlePatternEs
public static java.util.regex.Pattern titlePatternIt
public static final java.lang.String companyNameSuffix
public static final java.util.regex.Pattern companyNameSuffixPattern
public static final java.lang.String teamName
public static final java.util.regex.Pattern teamNamePattern
public static final java.lang.String prep
public static final java.lang.String laxName
public static final java.util.regex.Pattern laxNamePattern
public static final java.lang.String safeName
public static final java.util.regex.Pattern safeNamePattern
public static final java.util.regex.Pattern safeNamesPattern
public static final java.util.regex.Pattern safeNamesPatternNoPrep
public static final java.util.regex.Pattern laxAbbreviationPattern
public static final java.util.regex.Pattern safeAbbreviationPattern
public static final java.util.regex.Pattern laxCompanyPattern
public static final java.util.regex.Pattern safeCompanyPattern
public static final java.lang.String directFamilyNamePrefix
public static final java.lang.String personNameComponent
public static final java.lang.String givenNameComponent
public static final java.lang.String givenName
public static final java.lang.String givenNames
public static final java.lang.String familyName
public static final java.lang.String nickName
public static java.util.regex.Pattern laxPersonNamePatternEn
public static java.util.regex.Pattern laxPersonNamePatternDe
public static java.util.regex.Pattern laxPersonNamePatternEs
public static java.util.regex.Pattern laxPersonNamePatternFr
public static java.util.regex.Pattern laxPersonNamePatternIt
public static java.util.regex.Pattern safePersonNamePatternEn
public static java.util.regex.Pattern safePersonNamePatternDe
public static java.util.regex.Pattern safePersonNamePatternEs
public static java.util.regex.Pattern safePersonNamePatternFr
public static java.util.regex.Pattern safePersonNamePatternIt
public static java.util.Map<java.lang.String,java.lang.String> usStates
public static java.util.Map<java.lang.String,java.lang.String> languageCodes
public static java.util.Map<java.lang.String,java.lang.String> nationality2country
Method Detail |
---|
public static final void init(NonsharedParameters params)
public static final void init(java.lang.String configPath)
configPath
- The path that contains all word lists.public static final void init()
public static java.lang.String mul(java.lang.String s)
public static java.lang.String mulHyp(java.lang.String s)
public static java.lang.String opt(java.lang.String s)
public static java.lang.String optMul(java.lang.String s)
public static java.lang.String or(java.lang.String s1, java.lang.String s2)
public static java.lang.String c(java.lang.String s)
public static boolean isFamilyNamePrefix(java.lang.String s)
public static boolean isAttributePrefix(java.lang.String s)
public static boolean isPersonNameSuffix(java.lang.String s)
public static boolean isTitle(java.lang.String s, Language lang)
public static boolean isCompanyNameSuffix(java.lang.String s)
public static boolean isName(java.lang.String s)
public static boolean isNames(java.lang.String s)
public static boolean couldBeName(java.lang.String s)
public static boolean isStopWord(java.lang.String w, Language l)
public java.lang.String toString()
toString
in class java.lang.Object
public java.lang.String normalize()
public java.lang.String describe()
public java.lang.String original()
public static java.util.List<java.lang.String> readTextFileLines(java.lang.String configFile, java.lang.String encoding) throws java.io.IOException
inputFile
- text file to openencoding
- character encoding name (as used by Java) or null for UTF-8
java.io.IOException
public static java.io.InputStream getConfigFileStream(java.lang.String configfile) throws java.io.FileNotFoundException
java.io.FileNotFoundException
public static java.util.Set<java.lang.String> readTextFileLinesSet(java.lang.String configFile)
inputFile
-
public static boolean isAbbreviation(java.lang.String word)
public static boolean couldBeAbbreviation(java.lang.String word)
public static boolean isCompanyName(java.lang.String s)
public static boolean couldBeCompanyName(java.lang.String s)
public static boolean couldBePersonName(java.lang.String s, Language lang)
public static boolean isPersonName(java.lang.String m, Language lang)
public static boolean isUSState(java.lang.String s)
public static boolean isUSStateAbbreviation(java.lang.String s)
public static java.lang.String unabbreviateUSState(java.lang.String s)
public static boolean isLanguage(java.lang.String s)
public static boolean isLanguageCode(java.lang.String s)
public static java.lang.String languageForCode(java.lang.String s)
public static boolean isNation(java.lang.String s)
public static boolean isNationality(java.lang.String s)
public static java.lang.String nationForNationality(java.lang.String s)
public static NameML of(java.lang.String s, Language lang)
public static void main(java.lang.String[] argv) throws java.lang.Exception
java.lang.Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |