openclean.profiling.base module

Abstract base class for operators that perform data profiling on a sequence of data values.

Profilers can perform a wide range of tasks on a given sequence of values. Some profiling operators compute one or more features for all values in the sequence (e.g., frequency). Other examples of profilers detect outliers in a sequence of values. That is, they filter values based on some condition computed over the value features. Profilers can also compute new ‘value’, for example, when discovering patterns in the data.

class openclean.profiling.base.DataProfiler

Bases: object

Profiler for a stream of (scalar) values. A data profiler computes statistics (informative summaries) over all values in a data stream, i.e., values from a single column or multiple columns in a dataset.

Data profiler are stream-aware so that an implementation of a profiler can be used on data frames as well as with streams over rows in a dataset.

Data is passed to the profiler either as pairs of (value, count) where count is a frequency count (using the methods open, consume, close) or as a Counter with distinct values and their absolute counts (using the process method). In the case of a stream of (value, count)-pairs, the values in the stream are not guaranteed to be unique, i.e., the same value may be passed to the profiler multiple times (with potentially different counts).

The profiler returns a dictionary or a list with the profiling results. The elements and structure of the result are implementation dependent.

abstract close() Union[Dict, List]

Signal the end of the data stream. Returns the profiling result. The type of the result is a dictionary. The elements and structure in the dictionary are implementation dependent.

Return type

dict or list

abstract consume(value: Union[int, float, str, datetime.datetime], count: int)

Consume a pair of (value, count) in the data stream. Values in the stream are not guaranteed to be unique and may be passed to this consumer multiple times (with multiple counts).

Parameters
  • value (scalar) – Scalar column value from a dataset that is part of the data stream that is being profiled.

  • count (int) – Frequency of the value. Note that this count only relates to the given value and does not necessarily represent the total number of occurrences of the value in the stream.

abstract open()

Singnal the start of the data stream. This method can be used by implementations of the scalar profiler to initialize internal variables.

abstract process(values: collections.Counter) Union[Dict, List]

Compute one or more features over a set of distinct values. This is the main profiling function that computes statistics or informative summaries over the given data values. It operates on a compact form of a value list that only contains the distinct values and their frequency counts.

The return type of this function is a dictionary. The elements and structure in the dictionary are implementation dependent.

Parameters

values (collections.Counter) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

dict or list

run(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]]) Union[Dict, List]

Run the profiler using values that are generated from one or more columns (producers) for a given data frame. Evaluates the producers and creates a value count that is passed on to the process method for profiling.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • columns (int, string, list, or) – openclean.function.eval.base.EvalFunction Evaluation function to extract values from data frame rows. This can also be a a single column reference or a list of column references.

Return type

dict or list

class openclean.profiling.base.DataStreamProfiler

Bases: openclean.profiling.base.DataProfiler

Data stream profiler that implements the process method of the profiler function using the stream methods consume and close.

process(values: collections.Counter) Union[Dict, List]

Compute one or more features over a set of distinct values. Streams the elements in the given counter to the consume method.

Parameters

values (collections.Counter) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

dict or list

class openclean.profiling.base.DistinctSetProfiler

Bases: openclean.profiling.base.DataProfiler

Profiling function that collects all elements in the stream and then uses the process method to compute the profiling result.

close() Union[Dict, List]

Signal the end of the data stream. Returns the profiling result. The type of the result is a dictionary. The elements and structure in the dictionary are implementation dependent.

Return type

dict or list

consume(value: Union[int, float, str, datetime.datetime], count: int)

Consume a pair of (value, count) in the data stream. Collects all values in a counter dictionary.

Parameters
  • value (scalar) – Scalar column value from a dataset that is part of the data stream that is being profiled.

  • count (int) – Frequency of the value. Note that this count only relates to the given value and does not necessarily represent the total number of occurrences of the value in the stream.

open()

Initialize the counter at the beginning of the stream.