openclean.function.value.normalize.text module

Collection of functions to normalize test values.

openclean.function.value.normalize.text.NONDIACRITICS = {'©': 'c', 'ß': 'ss', 'æ': 'ae', 'ð': 'd', 'ø': 'oe', 'þ': 'th', 'đ': 'd', 'ħ': 'h', 'ı': 'i', 'ĸ': 'k', 'ł': 'l', 'ŋ': 'n', 'œ': 'oe', 'ŧ': 't', 'ſ': 's', 'ƿ': 'w', 'ɖ': 'd'}: First characters of unicode character categories that are removed. Currently we remove control characters ‘C’ and punctuation ‘P’.

class openclean.function.value.normalize.text.TextNormalizer(preproc: Optional[Callable] = None)

Bases: openclean.function.value.base.PreparedFunction

Text normalizer that replaces non-diacritic characters, umlauts, accents, etc. with their equivalent ascii character(s).

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) → str

Normalize a given value. Converts the value to string if it is not of type string. Then replaces all non-diacritic characters with their equivalent as defined in NONDIACRITICS. The last step is to use the uncide data normalize and encode function to replace umlauts, accents, etc. into their base character.

Parameters: value (scalar or tuple) – Value from the list that was used to prepare the function.
Return type: string

openclean.function.value.normalize.text.default_preproc(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) → str

Default pre-processing for string normalization. Ensures that the given argument is a string. Removes leading and trailing whitespaces, converts characters to lower case, and replaces all (consecutive) whitespaces with a single blank space character.

Parameters: value (scalar or tuple) – INput value that is being prepared for normalization.
Return type: string