openclean.function.value.normalize.text module

Collection of functions to normalize test values.

openclean.function.value.normalize.text.NONDIACRITICS = {'©': 'c', 'ß': 'ss', 'æ': 'ae', 'ð': 'd', 'ø': 'oe', 'þ': 'th', 'đ': 'd', 'ħ': 'h', 'ı': 'i', 'ĸ': 'k', 'ł': 'l', 'ŋ': 'n', 'œ': 'oe', 'ŧ': 't', 'ſ': 's', 'ƿ': 'w', 'ɖ': 'd'}

First characters of unicode character categories that are removed. Currently we remove control characters ‘C’ and punctuation ‘P’.

class openclean.function.value.normalize.text.TextNormalizer(preproc: Optional[Callable] = None)

Bases: openclean.function.value.base.PreparedFunction

Text normalizer that replaces non-diacritic characters, umlauts, accents, etc. with their equivalent ascii character(s).

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str

Normalize a given value. Converts the value to string if it is not of type string. Then replaces all non-diacritic characters with their equivalent as defined in NONDIACRITICS. The last step is to use the uncide data normalize and encode function to replace umlauts, accents, etc. into their base character.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

string

openclean.function.value.normalize.text.default_preproc(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str

Default pre-processing for string normalization. Ensures that the given argument is a string. Removes leading and trailing whitespaces, converts characters to lower case, and replaces all (consecutive) whitespaces with a single blank space character.

Parameters

value (scalar or tuple) – INput value that is being prepared for normalization.

Return type

string