openclean.profiling.stats module

Collection of statistics helper functions anc classes for profiling.

class openclean.profiling.stats.MinMaxCollector(first_value: Optional[Union[int, float, str, datetime.datetime]] = None, minmax: Optional[Tuple[Union[int, float, str, datetime.datetime], Union[int, float, str, datetime.datetime]]] = None)

Bases: dict

Consumer that identifies the minimum and maximum value over a stream of data. The class extends a dictionary for integration into profiling result dictionaries.

consume(value: Union[int, float, str, datetime.datetime])

Consume a value in the data stream and adjust the minimum and maximum if necessary.

Parameters

value (scalar) – Value in the data stream.

property maximum

Get the current maximum over all consumed values.

Return type

scalar

property minimum

Get the current minimum over all consumed values.

Return type

scalar

openclean.profiling.stats.entropy(values: collections.Counter, default: Optional[float] = None) float

Compute the entropy for a given set of distinct values and their frequency counts.

Returns the default value if the given counter is empty.

Parameters

values (collections.Counter) – Counter with frequencies for a set of distinct values.

Return type

float