openclean.cluster.base module
Interfaces for clustering. Openclean adopts the same notion of clustering as OpenRefine: […] clustering refers to the operation of ‘finding groups of different values that might be alternative representations of the same thing’*.
- class openclean.cluster.base.Cluster(**kwds)
Bases:
collections.Counter
Cluster of values. Maintains the frequency count for each value in order to be able to suggest a ‘new values’ as the common value for all values in the cluster.
- add(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], count: Optional[int] = 1) openclean.cluster.base.Cluster
Add a value to the cluster. Allows to provide a frequency count for the added value. Returns a reference to itself.
- Parameters
value (scalar or tuple) – Value that is added to the cluster.
- Return type
- suggestion() Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]
Suggest a new value as the common value for all values in the cluster. The suggestion is the most frequent value in the cluster. If multiple values have the same frequency the returned value depends on how ties are broken in the super class
collections.Counter
.- Return type
scalar or tuple
- to_mapping(target: Optional[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]] = None) Dict
Create a mapping for the values in the cluster to a given target value. This is primarily intended for standardization where all values in this cluster are mapped to a single target value.
If the target value is not specified the suggested value for this cluster is used as the default.
The resulting mapping will not include an entry for the target itself. That is, if the target is a value in the cluster that entry is excluded from the generated mapping.
- Parameters
target (scalar or tuple, default=None) – Target value to which all values in this cluster are mapped.
- Return type
dict
- class openclean.cluster.base.Clusterer
Bases:
openclean.operator.stream.processor.StreamProcessor
The value clusterer mixin class defines a single method clusters() to cluster a given list of values.
- abstract clusters(values: Union[Iterable[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]) List[openclean.cluster.base.Cluster]
Compute clusters for a given list of values. Each cluster itself is a list of values, i.e., a subset of values from the input list. The cluster method should be capable of taking a list of values or a counter of distinct values.
- Parameters
values (iterable of values or collections.Counter) – Iterable of data values or a value counter that maps values to their frequencies.
- Return type
list of openclean.cluster.base.Cluster
- open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer
Factory pattern for stream consumer.
Returns an instance of the stream clusterer that will collect the distinct values in the stream and then call the cluster method of this clusterer.
- Parameters
schema (list of string) – List of column names in the data stream schema.
- Return type
- class openclean.cluster.base.ONE
Bases:
object
Helper to simulate a counter where each value has a frequency of 1.
- class openclean.cluster.base.StreamClusterer(clusterer: openclean.cluster.base.Clusterer)
Bases:
openclean.operator.stream.consumer.StreamConsumer
Cluster values in a stream. This implementation will create a set of distinct values in the stream together with their frequency counts. It will then apply a given cluster algorithm on the created value set.
- close() List[openclean.cluster.base.Cluster]
Closing the consumer returns the result of applying the associated clusterer on the collected set of distinct values.
- Return type
list of openclean.cluster.base.Cluster
- consume(rowid: int, row: List)
Add the values in a given row to the internal counter.
If the row only has one value this value will be used as the key for the counter. For rows with multiple values the values in the row will be concatenated (separated by a blank space) to a single string value.
- Parameters
rowid (int) – Unique row identifier
row (list) – List of values in the row.