openclean.cluster.key module
Key-based clustering methods. Computes a different representation (key) for each value and clusters values based on these keyes.
- class openclean.cluster.key.KeyCollision(func: Union[Callable, openclean.function.value.base.ValueFunction], minsize: Optional[int] = 2, threads: Optional[int] = None)
Bases:
openclean.cluster.base.Clusterer
Key collision methods create an alternative representation for each value (i.e., a key), and then group values based on their keys.
Generates clusters that satisfy a given minimum size threshold. Allows to compute keys in parallel using multiple threads.
- clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) List[openclean.cluster.key.KeyCollisionCluster]
Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list.
- Parameters
values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
- Return type
list of openclean.cluster.key.KeyCollisionCluster
- class openclean.cluster.key.KeyCollisionCluster(key: str)
Bases:
openclean.cluster.base.Cluster
Key collision clusters are used to represent results of the key collision clusterer. The key collision cluster extends the super class with a reference to the collision key for all values in the cluster.
- class openclean.cluster.key.KeyValueGenerator(func: openclean.function.value.base.ValueFunction)
Bases:
object
Key-value pair generator for parallel processing.
- openclean.cluster.key.key_collision(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, minsize: Optional[int] = 2, threads: Optional[int] = None) List[openclean.cluster.key.KeyCollisionCluster]
Run key collision clustering for a given list of values.
- Parameters
values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
func (callable or ValueFunction, default=None) – Function that is used to generate keys for values. By default the token fingerprint generator is used.
minsize (int, default=2) – Minimum number of distinct values that each cluster in the returned result has to have.
threads (int, default=None) – Number of parallel threads to use for key generation. If None the value from the environment variable ‘OPENCLEAN_THREADS’ is used as the default.
- Return type
list of openclean.cluster.key.KeyCollisionCluster