openclean.function.value.key.fingerprint module
Collection of token key functions. These functions are used to generate keys for input values from intermediate token lists. The classes resemble similar functionality as found in OpenRefine:
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
- class openclean.function.value.key.fingerprint.Fingerprint(tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, normalizer: Optional[Callable] = None)
Bases:
openclean.function.value.base.PreparedFunction
Fingerprint key generator that is adopted from OpenRefine: http://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java
The main difference here is that we allow the user to provide their custom tokenizer and normalization functions. The steps for creating the key are similar to those explaind in: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
remove leading and trailing whitespace
convert string to lower case
Normalize string by removing punctuation and control characters and replacing non-diacritic characters (if the default normalizer is used).
Tokenize string by splitting on whitespace characters. Then sort the tokens and remove duplicates (if the default tokenizer is used).
Concatenate remaining (sorted) tokens using a single space character as the delimiter.
- eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str
Tokenize a given value and return a concatenated string of the resulting tokens.
- Parameters
value (scalar or tuple) – Input value that is tokenized and concatenated.
- Return type
string
- class openclean.function.value.key.fingerprint.NGramFingerprint(n: int, pleft: Optional[str] = None, pright: Optional[str] = None, normalizer: Optional[Callable] = None)
Bases:
openclean.function.value.key.fingerprint.Fingerprint
Fingerprint key generator that uses an n-gram tokenizer instead of the default tokenizer. This is a shortcut to instantiate the Fingerprint key generator.