openclean.cluster.key module

Key-based clustering methods. Computes a different representation (key) for each value and clusters values based on these keyes.

class openclean.cluster.key.KeyCollision(func: Union[Callable, openclean.function.value.base.ValueFunction], minsize: Optional[int] = 2, threads: Optional[int] = None)

Bases: openclean.cluster.base.Clusterer

Key collision methods create an alternative representation for each value (i.e., a key), and then group values based on their keys.

Generates clusters that satisfy a given minimum size threshold. Allows to compute keys in parallel using multiple threads.

clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) List[openclean.cluster.key.KeyCollisionCluster]

Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list.

Parameters

values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.

Return type

list of openclean.cluster.key.KeyCollisionCluster

class openclean.cluster.key.KeyCollisionCluster(key: str)

Bases: openclean.cluster.base.Cluster

Key collision clusters are used to represent results of the key collision clusterer. The key collision cluster extends the super class with a reference to the collision key for all values in the cluster.

class openclean.cluster.key.KeyValueGenerator(func: openclean.function.value.base.ValueFunction)

Bases: object

Key-value pair generator for parallel processing.

openclean.cluster.key.key_collision(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, minsize: Optional[int] = 2, threads: Optional[int] = None) List[openclean.cluster.key.KeyCollisionCluster]

Run key collision clustering for a given list of values.

Parameters
  • values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.

  • func (callable or ValueFunction, default=None) – Function that is used to generate keys for values. By default the token fingerprint generator is used.

  • minsize (int, default=2) – Minimum number of distinct values that each cluster in the returned result has to have.

  • threads (int, default=None) – Number of parallel threads to use for key generation. If None the value from the environment variable ‘OPENCLEAN_THREADS’ is used as the default.

Return type

list of openclean.cluster.key.KeyCollisionCluster