Text
rigour.text
dam_levenshtein(left, right, max_length=env.MAX_NAME_LENGTH, max_edits=None)
cached
Compute the Damerau-Levenshtein distance between two strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
Returns:
Type | Description |
---|---|
int
|
An integer of changed characters. |
Source code in rigour/text/distance.py
is_levenshtein_plausible(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)
A sanity check to post-filter name matching results based on a budget of allowed Levenshtein distance. This basically cuts off results where the Jaro-Winkler or Metaphone comparison was too lenient.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
max_edits
|
Optional[int]
|
The maximum number of edits allowed. |
LEVENSHTEIN_MAX_EDITS
|
max_percent
|
float
|
The maximum percentage of edits allowed. |
LEVENSHTEIN_MAX_PERCENT
|
Returns:
Type | Description |
---|---|
bool
|
A boolean. |
Source code in rigour/text/distance.py
jaro_winkler(left, right, max_length=env.MAX_NAME_LENGTH)
cached
Compute the Jaro-Winkler similarity of two strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
Returns:
Type | Description |
---|---|
float
|
A float between 0.0 and 1.0. |
Source code in rigour/text/distance.py
levenshtein(left, right, max_length=env.MAX_NAME_LENGTH, max_edits=None)
cached
Compute the Levenshtein distance between two strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
Returns:
Type | Description |
---|---|
int
|
An integer of changed characters. |
Source code in rigour/text/distance.py
levenshtein_similarity(left, right, max_edits=env.LEVENSHTEIN_MAX_EDITS, max_percent=env.LEVENSHTEIN_MAX_PERCENT, max_length=env.MAX_NAME_LENGTH)
Compute the Damerau Levenshtein similarity of two strings. The similiarity is the percentage distance measured against the length of the longest string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left
|
str
|
A string. |
required |
right
|
str
|
A string. |
required |
max_edits
|
Optional[int]
|
The maximum number of edits allowed. |
LEVENSHTEIN_MAX_EDITS
|
max_percent
|
float
|
The maximum fraction of the shortest string that is allowed to be edited. |
LEVENSHTEIN_MAX_PERCENT
|
Returns:
Type | Description |
---|---|
float
|
A float between 0.0 and 1.0. |
Source code in rigour/text/distance.py
metaphone(token)
cached
remove_bracketed_text(text)
Remove any text in brackets. This is meant to handle names of companies which include the jurisdiction, like: Turtle Management (Seychelles) Ltd.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
A text including text in brackets. |
required |
Returns:
Type | Description |
---|---|
str
|
Text where this has been substituted for whitespace. |
Source code in rigour/text/cleaning.py
remove_emoji(string)
Remove unicode ranges used by emoticons, symbolks, flags and other visual codepoints from a piece of text. Primary use case is to remove shit emojis from the names of political office holders coming from Wikidata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
string
|
str
|
Text that may include emoji and pictographs. |
required |
Returns:
Type | Description |
---|---|
str
|
Text that doesn't include those. |