openclean.function.value.normalize.text module
Collection of functions to normalize test values.
- openclean.function.value.normalize.text.NONDIACRITICS = {'©': 'c', 'ß': 'ss', 'æ': 'ae', 'ð': 'd', 'ø': 'oe', 'þ': 'th', 'đ': 'd', 'ħ': 'h', 'ı': 'i', 'ĸ': 'k', 'ł': 'l', 'ŋ': 'n', 'œ': 'oe', 'ŧ': 't', 'ſ': 's', 'ƿ': 'w', 'ɖ': 'd'}
First characters of unicode character categories that are removed. Currently we remove control characters ‘C’ and punctuation ‘P’.
- class openclean.function.value.normalize.text.TextNormalizer(preproc: Optional[Callable] = None)
Bases:
openclean.function.value.base.PreparedFunction
Text normalizer that replaces non-diacritic characters, umlauts, accents, etc. with their equivalent ascii character(s).
- eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str
Normalize a given value. Converts the value to string if it is not of type string. Then replaces all non-diacritic characters with their equivalent as defined in NONDIACRITICS. The last step is to use the uncide data normalize and encode function to replace umlauts, accents, etc. into their base character.
- Parameters
value (scalar or tuple) – Value from the list that was used to prepare the function.
- Return type
string
- openclean.function.value.normalize.text.default_preproc(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str
Default pre-processing for string normalization. Ensures that the given argument is a string. Removes leading and trailing whitespaces, converts characters to lower case, and replaces all (consecutive) whitespaces with a single blank space character.
- Parameters
value (scalar or tuple) – INput value that is being prepared for normalization.
- Return type
string