Names
rigour.names
Name handling utilities for person and organisation names. This module contains a large (and growing) set of tools for handling names. In general, there are three types of names: people, organizations, and objects. Different normalization may be required for each of these types, including prefix removal for person names (e.g. "Mr." or "Ms.") and type normalization for organization names (e.g. "Incorporated" -> "Inc" or "Limited" -> "Ltd").
The Name
class is meant to provide a structure for a name, including its original form, normalized form,
metadata on the type of thing described by the name, and the language of the name. The NamePart
class
is used to represent individual parts of a name, such as the first name, middle name, and last name.
Name
Bases: object
A name of a thing, such as a person, organization or object. Each name consists of a
sequence of parts, each of which has a form and a tag. The form is the text of the part, and the tag
is a label indicating the type of part. For example, in the name "John Smith", "John" is a given name
and "Smith" is a family name. The tag for "John" would be NamePartTag.GIVEN
and the tag for "Smith"
would be NamePartTag.FAMILY
. The form for both parts would be the text of the part itself.
Source code in rigour/names/name.py
NamePart
Bases: object
A part of a name, such as a given name or family name. This object is used to compare and match names. It generates and caches representations of the name in various processing forms.
Source code in rigour/names/part.py
NamePartTag
Bases: Enum
Within a name, identify name part types.
Source code in rigour/names/tag.py
NameTypeTag
Bases: Enum
Metadata on what sort of object is described by a name
Source code in rigour/names/tag.py
extract_org_types(name, normalizer=_normalize_compare)
Match any organization type designation (e.g. LLC, Inc, GmbH) in the given entity name and return the extracted type.
This can be used as a very poor man's method to determine if a given string is a company name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
Returns:
Type | Description |
---|---|
List[Tuple[str, str]]
|
Tuple[str, str]: Tuple of the org type as matched, and the compare form of it. |
Source code in rigour/names/org_types.py
is_name(name)
Check if the given string is a name. The string is considered a name if it contains at least one character that is a letter (category 'L' in Unicode).
Source code in rigour/names/check.py
load_person_names()
Load the person QID to name mappings from disk. This is a collection of aliases (in various alphabets) of person name parts mapped to a Wikidata QID representing that name part.
Returns:
Type | Description |
---|---|
None
|
Generator[Tuple[str, List[str]], None, None]: A generator yielding tuples of QID and list of names. |
Source code in rigour/names/person.py
load_person_names_mapping(normalizer=noop_normalizer)
Load the person QID to name mappings from disk. This is a collection of aliases (in various alphabets) of person name parts mapped to a Wikidata QID representing that name part.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
normalizer
|
Normalizer
|
A function to normalize names. Defaults to noop_normalizer. |
noop_normalizer
|
Returns:
Type | Description |
---|---|
Dict[str, Set[str]]
|
Dict[str, Set[str]]: A dictionary mapping normalized names to sets of QIDs. |
Source code in rigour/names/person.py
pick_case(names)
Pick the best mix of lower- and uppercase characters from a set of names that are identical except for case.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
List[str]
|
A list of identical names in different cases. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The best name for display. |
Source code in rigour/names/pick.py
pick_name(names)
Pick the best name from a list of names. This is meant to pick a centroid name, with a bias towards names in a latin script.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
List[str]
|
A list of names. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The best name for display. |
Source code in rigour/names/pick.py
remove_org_types(name, replacement='', normalizer=_normalize_compare)
Match any organization type designation (e.g. LLC, Inc, GmbH) in the given entity name and replace it with the given fixed string (empty by default, which signals removal).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with organization types replaced/removed. |
Source code in rigour/names/org_types.py
remove_person_prefixes(name)
replace_org_types_compare(name, normalizer=_normalize_compare)
Replace any organization type indicated in the given entity name (often as a prefix or suffix) with a heavily normalized form label. This will re-write country-specific entity types (eg. GmbH) into a globally normalized set of types (LLC). The resulting text is meant to be used in comparison processes, but no longer fit for presentation to a user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
_normalize_compare
|
Returns:
Type | Description |
---|---|
str
|
Optional[str]: The text with organization types replaced. |
Source code in rigour/names/org_types.py
replace_org_types_display(name, normalizer=normalize_display)
Replace organization types in the text with their shortened form. This will perform a display-safe (light) form of normalization, useful for shortening spelt-out legal forms into common abbreviations (eg. Siemens Aktiengesellschaft -> Siemens AG).
If the result of the replacement yields an empty string, the original text is returned as-is.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The text to be processed. It is assumed to be already normalized (see below). |
required |
normalizer
|
Callable[[str | None], str | None]
|
A text normalization function to run on the lookup values before matching to remove text anomalies and make matches more likely. |
normalize_display
|
Returns:
Type | Description |
---|---|
str
|
Optional[str]: The text with organization types replaced. |
Source code in rigour/names/org_types.py
tokenize_name(text, token_min_length=1)
Split a person or entity's name into name parts.