openclean.cluster.base module

Interfaces for clustering. Openclean adopts the same notion of clustering as OpenRefine: […] clustering refers to the operation of ‘finding groups of different values that might be alternative representations of the same thing’*.

class openclean.cluster.base.Cluster(**kwds)

Bases: collections.Counter

Cluster of values. Maintains the frequency count for each value in order to be able to suggest a ‘new values’ as the common value for all values in the cluster.

add(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], count: Optional[int] = 1) openclean.cluster.base.Cluster

Add a value to the cluster. Allows to provide a frequency count for the added value. Returns a reference to itself.

Parameters

value (scalar or tuple) – Value that is added to the cluster.

Return type

openclean.cluster.base.Cluster

suggestion() Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Suggest a new value as the common value for all values in the cluster. The suggestion is the most frequent value in the cluster. If multiple values have the same frequency the returned value depends on how ties are broken in the super class collections.Counter.

Return type

scalar or tuple

to_mapping(target: Optional[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]] = None) Dict

Create a mapping for the values in the cluster to a given target value. This is primarily intended for standardization where all values in this cluster are mapped to a single target value.

If the target value is not specified the suggested value for this cluster is used as the default.

The resulting mapping will not include an entry for the target itself. That is, if the target is a value in the cluster that entry is excluded from the generated mapping.

Parameters

target (scalar or tuple, default=None) – Target value to which all values in this cluster are mapped.

Return type

dict

class openclean.cluster.base.Clusterer

Bases: openclean.operator.stream.processor.StreamProcessor

The value clusterer mixin class defines a single method clusters() to cluster a given list of values.

abstract clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) List[openclean.cluster.base.Cluster]

Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list. The cluster method should be capable of taking a list of values or a counter of distinct values.

Parameters

values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.

Return type

list of openclean.cluster.base.Cluster

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream consumer.

Returns an instance of the stream clusterer that will collect the distinct values in the stream and then call the cluster method of this clusterer.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.cluster.base.StreamClusterer

class openclean.cluster.base.ONE

Bases: object

Helper to simulate a counter where each value has a frequency of 1.

class openclean.cluster.base.StreamClusterer(clusterer: openclean.cluster.base.Clusterer)

Bases: openclean.operator.stream.consumer.StreamConsumer

Cluster values in a stream. This implementation will create a set of distinct values in the stream together with their frequency counts. It will then apply a given cluster algorithm on the created value set.

close() List[openclean.cluster.base.Cluster]

Closing the consumer returns the result of applying the associated clusterer on the collected set of distinct values.

Return type

list of openclean.cluster.base.Cluster

consume(rowid: int, row: List)

Add the values in a given row to the internal counter.

If the row only has one value this value will be used as the key for the counter. For rows with multiple values the values in the row will be concatenated (separated by a blank space) to a single string value.

Parameters
  • rowid (int) – Unique row identifier

  • row (list) – List of values in the row.