Skip to content

Addresses

rigour.addresses

This module provides a set of tools for handling postal/geographic addresses. It includes functions for normalising addresses for comparison purposes, and for formatting addresses given in parts for display as a single string.

Postal address formatting

This set of helpers is designed to help with the processing of real-world addresses, including composing an address from individual parts, and cleaning it up.

from rigour.addresses import format_address_line

address = {
    "road": "Bahnhofstr.",
    "house_number": "10",
    "postcode": "86150",
    "city": "Augsburg",
    "state": "Bayern",
    "country": "Germany",
}
address_text = format_address_line(address, country="DE")
Acknowledgements

The address formatting database contained in rigour/data/addresses/formats.yml is derived from worldwide.yml in the OpenCageData address-formatting repository. It is used to format addresses according to customs in the country that is been encoded.

clean_address(full)

Remove common formatting errors from addresses.

Source code in rigour/addresses/cleaning.py
def clean_address(full: str) -> str:
    """Remove common formatting errors from addresses."""
    while True:
        full, count = REPL.subn(_sub_match, full)
        if count == 0:
            break
    return full.strip()

format_address(address, country=None)

Format the given address part into a multi-line string that matches the conventions of the country of the given address.

Parameters:

Name Type Description Default
address Dict[str, Optional[str]]

The address parts to be combined. Common parts include: summary: A short description of the address. po_box: The PO box/mailbox number. street: The street or road name. house: The descriptive name of the house. house_number: The number of the house on the street. postal_code: The postal code or ZIP code. city: The city or town name. county: The county or district name. state: The state or province name. state_district: The state or province district name. state_code: The state or province code. country: The name of the country (words, not ISO code). country_code: A pre-normalized country code.

required
country Optional[str]

ISO code for the country of the address.

None

Returns:

Type Description
str

A single-line string with the formatted address.

Source code in rigour/addresses/format.py
def format_address(
    address: Dict[str, Optional[str]], country: Optional[str] = None
) -> str:
    """Format the given address part into a multi-line string that matches the
    conventions of the country of the given address.

    Args:
        address: The address parts to be combined. Common parts include:
            summary: A short description of the address.
            po_box: The PO box/mailbox number.
            street: The street or road name.
            house: The descriptive name of the house.
            house_number: The number of the house on the street.
            postal_code: The postal code or ZIP code.
            city: The city or town name.
            county: The county or district name.
            state: The state or province name.
            state_district: The state or province district name.
            state_code: The state or province code.
            country: The name of the country (words, not ISO code).
            country_code: A pre-normalized country code.
        country: ISO code for the country of the address.

    Returns:
        A single-line string with the formatted address.
    """
    text = _format(address, country=country)
    prev: Optional[str] = None
    while prev != text:
        prev = text
        text = text.replace("\n\n", "\n").replace("\n ", "\n").strip()
    return text

format_address_line(address, country=None)

Format the given address part into a single-line string that matches the conventions of the country of the given address.

Parameters:

Name Type Description Default
address Dict[str, Optional[str]]

The address parts to be combined. Common parts include: summary: A short description of the address. po_box: The PO box/mailbox number. street: The street or road name. house: The descriptive name of the house. house_number: The number of the house on the street. postal_code: The postal code or ZIP code. city: The city or town name. county: The county or district name. state: The state or province name. state_district: The state or province district name. state_code: The state or province code. country: The name of the country (words, not ISO code). country_code: A pre-normalized country code.

required
country Optional[str]

ISO code for the country of the address.

None

Returns:

Type Description
str

A single-line string with the formatted address.

Source code in rigour/addresses/format.py
def format_address_line(
    address: Dict[str, Optional[str]], country: Optional[str] = None
) -> str:
    """Format the given address part into a single-line string that matches the
    conventions of the country of the given address.

    Args:
        address: The address parts to be combined. Common parts include:
            summary: A short description of the address.
            po_box: The PO box/mailbox number.
            street: The street or road name.
            house: The descriptive name of the house.
            house_number: The number of the house on the street.
            postal_code: The postal code or ZIP code.
            city: The city or town name.
            county: The county or district name.
            state: The state or province name.
            state_district: The state or province district name.
            state_code: The state or province code.
            country: The name of the country (words, not ISO code).
            country_code: A pre-normalized country code.
        country: ISO code for the country of the address.

    Returns:
        A single-line string with the formatted address.
    """
    line = ", ".join(_format(address, country=country).split("\n"))
    return clean_address(line)

normalize_address(address, latinize=False, min_length=4)

Normalize the given address string for comparison, in a way that is destructive to the ability for displaying it (makes it ugly).

Parameters:

Name Type Description Default
address str

The address to be normalized.

required
latinize bool

Whether to convert non-Latin characters to their Latin equivalents.

False
min_length int

Minimum length of the normalized address.

4

Returns:

Type Description
Optional[str]

The normalized address.

Source code in rigour/addresses/normalize.py
def normalize_address(
    address: str, latinize: bool = False, min_length: int = 4
) -> Optional[str]:
    """Normalize the given address string for comparison, in a way that is destructive to
    the ability for displaying it (makes it ugly).

    Args:
        address: The address to be normalized.
        latinize: Whether to convert non-Latin characters to their Latin equivalents.
        min_length: Minimum length of the normalized address.

    Returns:
        The normalized address.
    """
    tokens: List[List[str]] = []
    token: List[str] = []
    for char in address.lower():
        if char in CHARS_ALLOWED:
            chr: Optional[str] = char
        else:
            cat = unicodedata.category(char)
            chr = TOKEN_SEP_CATEGORIES.get(cat, char)
        if chr is None:
            continue
        if chr == WS:
            if len(token):
                tokens.append(token)
            token = []
            continue
        token.append(chr)
    if len(token):
        tokens.append(token)

    parts: List[str] = []
    for token in tokens:
        token_str: Optional[str] = "".join(token)
        if latinize:
            token_str = ascii_text(token_str)
        if token_str is None:
            continue
        parts.append(token_str)
    norm_address = WS.join(parts)
    if len(norm_address) < min_length:
        return None
    return norm_address

remove_address_keywords(address, latinize=False, replacement=WS)

Remove common address keywords (such as "street", "road", "south", etc.) from the given address string. The address string is assumed to have already been normalized using normalize_address.

The output may contain multiple consecutive whitespace characters, which are not collapsed.

Parameters:

Name Type Description Default
address str

The address to be cleaned.

required
latinize bool

Whether to convert non-Latin characters to their Latin equivalents.

False

Returns:

Type Description
Optional[str]

The address, without any stopwords.

Source code in rigour/addresses/normalize.py
def remove_address_keywords(
    address: str, latinize: bool = False, replacement: str = WS
) -> Optional[str]:
    """Remove common address keywords (such as "street", "road", "south", etc.) from the
    given address string. The address string is assumed to have already been normalized
    using `normalize_address`.

    The output may contain multiple consecutive whitespace characters, which are not collapsed.

    Args:
        address: The address to be cleaned.
        latinize: Whether to convert non-Latin characters to their Latin equivalents.

    Returns:
        The address, without any stopwords.
    """
    replacer = _address_replacer(latinize=latinize)
    return replacer.remove(address, replacement=replacement)

shorten_address_keywords(address, latinize=False)

Shorten common address keywords (such as "street", "road", "south", etc.) in the given address string. The address string is assumed to have already been normalized using normalize_address.

Parameters:

Name Type Description Default
address str

The address to be cleaned.

required
latinize bool

Whether to convert non-Latin characters to their Latin equivalents.

False

Returns:

Type Description
Optional[str]

The address, with keywords shortened.

Source code in rigour/addresses/normalize.py
def shorten_address_keywords(address: str, latinize: bool = False) -> Optional[str]:
    """Shorten common address keywords (such as "street", "road", "south", etc.) in the
    given address string. The address string is assumed to have already been normalized
    using `normalize_address`.

    Args:
        address: The address to be cleaned.
        latinize: Whether to convert non-Latin characters to their Latin equivalents.

    Returns:
        The address, with keywords shortened.
    """
    replacer = _address_replacer(latinize=latinize)
    return replacer(address)