openclean.function.eval.aggregate module

Collection of evaluation functions that return a computed statistic over one or more data frame columns for all data frame rows.

class openclean.function.eval.aggregate.Avg(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]])

Bases: openclean.function.eval.base.Eval

Evaluation function that returns the mean of values for one or more columns in a data frame.

class openclean.function.eval.aggregate.ColumnAggregator(func: Callable)

Bases: openclean.function.value.base.ValueFunction

Value function that computes an aggregate over a list of values. The aggregated value is computed when the function is prepared. It then returns a constant value function that is initialized with the aggregation result, i.e., that will return the aggregation result for any input value.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]])

Raises an error. The column aggregator can only be used to prepare a constant value funciton.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Raises

NotImplementedError

is_prepared() bool

The column aggregator has to be prepared.

Return type

bool

prepare(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) openclean.function.value.base.ConstantValue

Optional step to prepare the function for a given set of values. This step allows to compute additional statistics over the set of values.

While it is likely that the given set of values represents the values for which the eval() function will be called, this property is not guaranteed.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

openclean.function.value.base.ConstantValue

class openclean.function.eval.aggregate.Count(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], value: Optional[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]] = True)

Bases: openclean.function.eval.base.Eval

Evaluation function that counts the number of values in one or more columns that match a given value.

class openclean.function.eval.aggregate.Max(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]])

Bases: openclean.function.eval.base.Eval

Evaluation function that returns the maximum of values for one or more columns in a data frame.

class openclean.function.eval.aggregate.Min(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]])

Bases: openclean.function.eval.base.Eval

Evaluation function that returns the minimum of values for one or more columns in a data frame.

class openclean.function.eval.aggregate.Sum(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]])

Bases: openclean.function.eval.base.Eval

Evaluation function that returns the sum over values for one or more columns in a data frame.