openclean.cluster.base module

Interfaces for clustering. Openclean adopts the same notion of clustering as OpenRefine: […] clustering refers to the operation of ‘finding groups of different values that might be alternative representations of the same thing’*.

https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

class openclean.cluster.base.Cluster(**kwds)

Bases: collections.Counter

Cluster of values. Maintains the frequency count for each value in order to be able to suggest a ‘new values’ as the common value for all values in the cluster.

add(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], count: Optional[int] = 1) → openclean.cluster.base.Cluster

Add a value to the cluster. Allows to provide a frequency count for the added value. Returns a reference to itself.

Parameters: value (scalar or tuple) – Value that is added to the cluster.
Return type: openclean.cluster.base.Cluster

suggestion() → Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Suggest a new value as the common value for all values in the cluster. The suggestion is the most frequent value in the cluster. If multiple values have the same frequency the returned value depends on how ties are broken in the super class collections.Counter.

Return type: scalar or tuple

to_mapping(target: Optional[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]] = None) → Dict

Create a mapping for the values in the cluster to a given target value. This is primarily intended for standardization where all values in this cluster are mapped to a single target value.

If the target value is not specified the suggested value for this cluster is used as the default.

The resulting mapping will not include an entry for the target itself. That is, if the target is a value in the cluster that entry is excluded from the generated mapping.

Parameters: target (scalar or tuple, default=None) – Target value to which all values in this cluster are mapped.
Return type: dict

class openclean.cluster.base.Clusterer

Bases: openclean.operator.stream.processor.StreamProcessor

The value clusterer mixin class defines a single method clusters() to cluster a given list of values.

abstract clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) → List[openclean.cluster.base.Cluster]

Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list. The cluster method should be capable of taking a list of values or a counter of distinct values.

Parameters: values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
Return type: list of openclean.cluster.base.Cluster

open(schema: List[Union[str, histore.document.schema.Column]]) → openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream consumer.

Returns an instance of the stream clusterer that will collect the distinct values in the stream and then call the cluster method of this clusterer.

Parameters: schema (list of string) – List of column names in the data stream schema.
Return type: openclean.cluster.base.StreamClusterer

class openclean.cluster.base.ONE

Bases: object

Helper to simulate a counter where each value has a frequency of 1.

class openclean.cluster.base.StreamClusterer(clusterer: openclean.cluster.base.Clusterer)

Bases: openclean.operator.stream.consumer.StreamConsumer

Cluster values in a stream. This implementation will create a set of distinct values in the stream together with their frequency counts. It will then apply a given cluster algorithm on the created value set.

close() → List[openclean.cluster.base.Cluster]

Closing the consumer returns the result of applying the associated clusterer on the collected set of distinct values.

Return type: list of openclean.cluster.base.Cluster

consume(rowid: int, row: List)

Add the values in a given row to the internal counter.

If the row only has one value this value will be used as the key for the counter. For rows with multiple values the values in the row will be concatenated (separated by a blank space) to a single string value.

Parameters

rowid (int) – Unique row identifier
row (list) – List of values in the row.