openclean.profiling.column module

Default profiler for columns i a dataset. Defines a profiler for columns in a in-memory dataset as well as for dataset streams.

The information that is collected by these profilers differs. The in-memory profiler is able to collect additiona information (e.g., top-k values) that the stream profiler cannot collect.

class openclean.profiling.column.ColumnProfile(converter: openclean.profiling.datatype.convert.DatatypeConverter, values: Optional[collections.Counter] = None, top_k: Optional[int] = None)

Bases: dict

Dictionary of profiling results for the openclean column profiler.

consume(value: Union[int, float, str, datetime.datetime], count: int, distinct: Optional[bool] = False) → Union[int, float, str, datetime.datetime]

Consume a pair of (value, count) in the data stream. Values in the stream are not guaranteed to be unique and may be passed to this consumer multiple times (with multiple counts).

Returns the given value if it is not an empty value. Otherwise, the returned result in None.

Parameters

value (scalar) – Scalar column value from a dataset that is part of the data stream that is being profiled.
count (int) – Frequency of the value. Note that this count only relates to the given value and does not necessarily represent the total number of occurrences of the value in the stream.
distinct (bool, default=False) – Count distinct and total values for data types if this flag is True.

Return type

scalar

distinct(top_k: Optional[int] = None) → collections.Counter

Get Counter object containing the list most frequent values and their counts that was generated by the profiler.

Parameters: top_k (int, default=None) – Limit the number of elements in the returned Counter to the k most common values (if given). If None, the full set of values is returned.
Return type: collections.Counter

class openclean.profiling.column.DefaultColumnProfiler(top_k: Optional[int] = 10, converter: Optional[openclean.profiling.datatype.convert.DatatypeConverter] = None)

Bases: openclean.profiling.base.DistinctSetProfiler

Default profiler for columns in a data frame. This profiler does maintain a set of distinct values and includes the most frequent values in the returned result dictionary. Also extends the basic column profiler with data types for all values in the column.

The result schema for the returned dictionary is:

{

“minmaxValues”: smallest and largest not-None value for each data type: in the stream,

“emptyValueCount”: number of empty values in the column, “totalValueCount”: number of total values (including empty ones), “distinctValueCount”: number of distinct values in the column, “entropy”: entropy for distinct values in the column, “topValues”: List of most frequent values in the column, “datatypes”: Counter of type labels for all non-empty values

}

process(values: collections.Counter) → openclean.profiling.column.ColumnProfile

Compute profile for given counter of values in the column.

Parameters: values (collections.Counter) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.
Return type: dict

class openclean.profiling.column.DefaultStreamProfiler(converter: Optional[openclean.profiling.datatype.convert.DatatypeConverter] = None)

Bases: openclean.profiling.base.DataStreamProfiler

Default profiler for columns in a data stream. This profiler does not maintain a set of distinct values due to the unkown size of the stream and the amount of memory that is required to maintain all values in the stream in an internal counter.

Extends the basic column profiler with data types that are computed for each value in the stream as they arrive via the consumer method.

The result schema for the returned dictionary is:

{

“minmaxValues”: smallest and largest not-None value for each data type: in the stream,

“emptyValueCount”: number of empty values in the stream, “totalValueCount”: number of total values (including empty ones), “datatypes”: Counter of type labels for all non-empty values

}

close() → openclean.profiling.column.ColumnProfile

Return the dictionary with collected statistics at the end of the data stream.

Return type: dict

consume(value: Union[int, float, str, datetime.datetime], count: int)

Consume a pair of (value, count) in the data stream. Values in the stream are not guaranteed to be unique and may be passed to this consumer multiple times (with multiple counts).

Parameters

value (scalar) – Scalar column value from a dataset that is part of the data stream that is being profiled.
count (int) – Frequency of the value. Note that this count only relates to the given value and does not necessarily represent the total number of occurrences of the value in the stream.

open(): Initialize the internal variables that maintain different parts of the generated profiling result.

class openclean.profiling.column.DistinctValueProfiler(converter: Optional[openclean.profiling.datatype.convert.DatatypeConverter] = None)

Bases: openclean.profiling.column.DefaultColumnProfiler

Column profiler that maintains the full list of distinct values in a column. This class is a simple wrapper for the openclean.profiling.column.DefaultColumnProfiler that sets top_k=None.