openclean.cluster.knn module

Implementation of the Nearest Neighbor clustering (also known as kNN) that use a string similarity function and a threshold (radius).

The kNN clustering brings together strings that have a similarity which is within the given radius constraint. This implementation is based on the hybrid blocking approach that is implemented in OpenRefine: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

The algorithm works by performing a first pass over the strings in order to group them into blocks of strings that share at least on n-gram. It then uses a given string similarity function to compute similarity between strings in the created blocks.

class openclean.cluster.knn.kNNClusterer(sim: openclean.function.similarity.base.SimilarityConstraint, tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, minsize: Optional[int] = 2, remove_duplicates: Optional[bool] = True)

Bases: openclean.cluster.base.Clusterer

Nearest Neighbor clustering algorithm that is based on a hybrid clustring approach.

The algorithm works by performing a first pass over the strings in order to group them into blocks of strings that share at least on token (e.g., n-gram). It then uses a given string similarity function to compute similarity between strings in the created blocks.

clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) → List[openclean.cluster.base.Cluster]

Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list.

Parameters: values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
Return type: list of openclean.cluster.base.Cluster

openclean.cluster.knn.knn_clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter], sim: openclean.function.similarity.base.SimilarityConstraint, tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, minsize: Optional[int] = 2, remove_duplicates: Optional[bool] = True) → List[openclean.cluster.base.Cluster]

Run kNN clustering for a given list of values.

Parameters

values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
sim (openclean.function.similarity.base.SimilarityConstraint) – String similarity constraint for grouping strings in the generated blocks.
tokenizer (openclean.function.token.base.Tokenizer, default=None) – Generator for tokens that are used to group string values in the first step of the algorithm. By default, n-grams of length 6 are used as blocking tokens.
minsize (int, default=2) – Minimum number of distinct values that each cluster in the returned result has to have.
remove_duplicates (bool, default=True) – Remove identical clusters from the result if True.

Return type

list of openclean.cluster.base.Cluster

openclean.cluster.knn.knn_collision_clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter], sim: openclean.function.similarity.base.SimilarityConstraint, keys: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, minsize: Optional[int] = 2, remove_duplicates: Optional[bool] = True, threads: Optional[int] = None) → List[openclean.cluster.base.Cluster]

Run kNN clustering on a set of values that have been grouped using collision clustering.

This algorithm first performs collision key clustering for the given list of values using the key generator. It then uses kNN clustering on the keys for the generated clusters.

Parameters

values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
sim (openclean.function.similarity.base.SimilarityConstraint) – String similarity constraint for grouping strings in the generated blocks.
keys (callable or ValueFunction, default=None) – Function that is used to generate keys for values. By default the token fingerprint generator is used.
tokenizer (openclean.function.token.base.Tokenizer, default=None) – Generator for tokens that are used to group string values in the first step of the algorithm. By default, n-grams of length 6 are used as blocking tokens.
minsize (int, default=2) – Minimum number of distinct values that each cluster in the returned result has to have.
remove_duplicates (bool, default=True) – Remove identical clusters from the result if True.
threads (int, default=None) – Number of parallel threads to use for key generation. If None the value from the environment variable ‘OPENCLEAN_THREADS’ is used as the default.

Return type

list of openclean.cluster.base.Cluster