openclean.function.value.key.fingerprint module

Collection of token key functions. These functions are used to generate keys for input values from intermediate token lists. The classes resemble similar functionality as found in OpenRefine:

https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

class openclean.function.value.key.fingerprint.Fingerprint(tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, normalizer: Optional[Callable] = None)

Bases: openclean.function.value.base.PreparedFunction

Fingerprint key generator that is adopted from OpenRefine: http://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java

The main difference here is that we allow the user to provide their custom tokenizer and normalization functions. The steps for creating the key are similar to those explaind in: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

  1. remove leading and trailing whitespace

  2. convert string to lower case

  3. Normalize string by removing punctuation and control characters and replacing non-diacritic characters (if the default normalizer is used).

  4. Tokenize string by splitting on whitespace characters. Then sort the tokens and remove duplicates (if the default tokenizer is used).

  5. Concatenate remaining (sorted) tokens using a single space character as the delimiter.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str

Tokenize a given value and return a concatenated string of the resulting tokens.

Parameters

value (scalar or tuple) – Input value that is tokenized and concatenated.

Return type

string

class openclean.function.value.key.fingerprint.NGramFingerprint(n: int, pleft: Optional[str] = None, pright: Optional[str] = None, normalizer: Optional[Callable] = None)

Bases: openclean.function.value.key.fingerprint.Fingerprint

Fingerprint key generator that uses an n-gram tokenizer instead of the default tokenizer. This is a shortcut to instantiate the Fingerprint key generator.