openclean.function.eval.base module

Base classes for data frame manipulating functions. Evaluation functions are applied to one or more columns in a data frame. Functions are expected to return either a data series or a list of scalar values.

class openclean.function.eval.base.Add(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘+’ operator.

class openclean.function.eval.base.BinaryOperator(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], op: Callable)

Bases: openclean.function.eval.base.EvalFunction

Generic operator for comparing or transforming two column value expressions (that are represented as evaluation functions).

eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Evaluate the binary operator on a given data frame. The result is either as single data series or a list of scalarn values.

Parameters

df (pd.DataFrame) – Pandas data frame.

Return type

pd.Series or list

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Prepare both evaluation functions (lhs and rhs) and return a binary operator stream function.

Parameters

columns (list of string) – Schema for data stream rows.

Return type

openclean.data.stream.base.StreamFunction

class openclean.function.eval.base.BinaryStreamFunction(lhs: Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], rhs: Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], op: Callable)

Bases: object

Binary operator for data streams. Evaluates a given binary function on the result of two stream functions.

class openclean.function.eval.base.Col(column: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction], colidx: Optional[int] = None)

Bases: openclean.function.eval.base.EvalFunction

Evaluation function that returns the value from a single column in a data frame row. Extends the abstract evaluation function and implements the stream function interface. For a stream function the internal _colidx has to be defined (given at object construction).

eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Get the values from the data frame column that is referenced by this function.

Parameters

values (pandas.core.series.Series) – Row in a pandas data frame.

Return type

pd.Series

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Return a Col function that is prepared, i.e., that has the column index for the column that it operates on initialized.

Parameters

columns (list of string) – List of column names in the schema of the data stream.

Return type

openclean.data.stream.base.StreamFunction

class openclean.function.eval.base.Cols(columns: Union[int, str, List[Union[str, int]]], colidxs: Optional[List[int]] = None)

Bases: openclean.function.eval.base.EvalFunction

Evaluation function that returns a tuple of values from one or more column(s) in the data frame row. Extends the abstract evaluation function and implements the stream function interface. For a stream function the internal _colidxs have to be defined (given at object construction).

eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Get the values from the data frame columns that are referenced by this function. Returns a list of tuples with one value for each of the referenced columns.

Parameters

values (pandas.core.series.Series) – Row in a pandas data frame.

Return type

list

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Return a Cols function that is prepared, i.e., that has the column indexes for the columns that it operates on initialized.

Parameters

columns (list of string) – List of column names in the schema of the data stream.

Return type

openclean.data.stream.base.StreamFunction

class openclean.function.eval.base.Const(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]])

Bases: openclean.function.eval.base.EvalFunction

Evaluation function that returns a constant value for each data frame row. Extends the abstract evaluation function and implements the stream function interface.

eval(df: pandas.core.frame.DataFrame) List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Execute method for the evaluation function. Returns a list in the length of the data frame (row count) with the defined constant value.

Parameters

df (pd.DataFrame) – Pandas data frame.

Return type

list

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

The prepare method returns a callable that returns the constant value for evary input row.

Parameters

columns (list of string) – List of column names in the schema of the data stream.

Return type

openclean.data.stream.base.StreamFunction

class openclean.function.eval.base.Divide(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘/’ operator.

class openclean.function.eval.base.Eq(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Binary equality comparison predicate.

class openclean.function.eval.base.Eval(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Union[Callable, openclean.function.value.base.ValueFunction], args: Optional[Dict] = None, is_unary: Optional[bool] = None)

Bases: openclean.function.eval.base.EvalFunction

Eval is a factory for evaluation functions that extract values from one or more columns in data frame rows and that evaluate a given function (consumer) on the extracted values.

We distinguish between unary evaluation functions that extract values from a single column and ternary evaluation functions that extract values from two or more columns. For the consumer we also distinguish between unary and ternary functions.

The arity of an evaluation function is detemined by the number of input columns that are specified when calling the Eval factory. The arity of the consumer cannot be determined automatically but has to be specified by the user in the is_unary parameter.

A ternary evaluation function with a unary consumer will pass a tuple with the extracted values to the consumer. A unary evaluation function with a ternary consumer will raise a TypeError error in the constructor.

decorate(func)

Decorate the given function with the optional keyword arguments that were given (if given) in the constructor for the Eval function.

Parameters

func (callable) – Function that is being decorated.

Return type

callable

eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Evaluate the consumer on the lists of values that are generated by the referenced columns.

Parameters

df (pd.DataFrame) – Pandas data frame.

Return type

pd.Series or list

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Create a stream function that applies the consumer on the results from one or more stream functions for the producers.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

openclean.function.eval.base.EvalFunction

class openclean.function.eval.base.EvalFunction

Bases: object

Evaluation functions are used to compute results over rows in a data frame or a data stream. Conceptually, evaluation functions are evaluated over one or more columns for each row in the input data. For each row, the function is expected to generate one (or more) (transformed) value(s) for the column (columns) on which it operates.

Evaluation functions are building blocks for data frame operators as well as data stream pipelines. Each of these two use cases is supported by a different (abstract) method:

  • eval: The eval function is used by data frame operators. The function receives the full data frame as an argument. It returns a data series (or list) of values with one value for each row in the input data frame. Functions that operate over multiple columns will return a list of tuples.

  • prepare: If an evaluation function is used as part of a data stream operator the function needs to be prepared. That is, the function will need to know the schema of the rows in the data frame before streaming starts. The prepare method receives the schema of the data stream as an argument. It returns a callable function that accepts a data stream row as the only argument and that returns a single value or a tuple of values depending on whether the evaluation function operators on one or more columns.

abstract eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Evaluate the function on a given data frame. The result is either a data series or a list of values. The resulting data contains one output value per input row. If the evaluation function operates over multiple columns then the result will be a list of tuples with the size of each tuple matching the number of columns the function operates on.

Parameters

df (pd.DataFrame) – Pandas data frame.

Return type

pd.Series or list

abstract prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Prepare the evaluation function to be able to process rows in a data stream. This method is called before streaming starts to inform the function about the schema of the rows in the data stream.

Prepare is expected to return a callable that accepts a single data stream row as input and that returns a single value (if the function operates on a single column) or a tuple of values (for functions that operate over multiple columns).

Parameters

columns (list of string) – List of column names in the schema of the data stream.

Return type

openclean.data.stream.base.StreamFunction

class openclean.function.eval.base.FloorDivide(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘//’ operator.

class openclean.function.eval.base.Geq(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Predicate for ‘>=’ comparison.

class openclean.function.eval.base.Gt(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Predicate for ‘>’ comparison.

class openclean.function.eval.base.Leq(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Predicate for ‘<=’ comparison.

class openclean.function.eval.base.Lt(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Predicate for ‘<’ comparison.

class openclean.function.eval.base.Multiply(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘*’ operator.

class openclean.function.eval.base.Neq(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Predicate for ‘!=’ comparison.

class openclean.function.eval.base.Pow(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘**’ operator.

class openclean.function.eval.base.Subtract(lhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction], rhs: Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction])

Bases: openclean.function.eval.base.BinaryOperator

Arithmetic ‘-’ operator.

class openclean.function.eval.base.TernaryStreamFunction(producers: Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], consumer: Callable, is_unary: Optional[bool] = False)

Bases: object

A ternary stream function extracts values using multiple producers and passes them to a single consumer. The consumer may either be a unary or a ternary function. An unary function will receive a tuple of extracted values as the argument.

class openclean.function.eval.base.UnaryStreamFunction(producer: Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], consumer: Callable)

Bases: object

Unary operator for data streams. Evaluates a given unary function on the result of another stream function.

openclean.function.eval.base.evaluate(df: pandas.core.frame.DataFrame, producers: List[openclean.function.eval.base.EvalFunction]) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Helper method to extract a list of values (i.e., an evaluation result) from a data frame using one or more producers (evaluation functions).

Results are generated by evaluating the given producers individually. If a single producer is given, the result from that producer will be returned. If multiple producers are given, a list of tuples with results from each consumer will be returned.

Parameters
  • df (pd.DataFrame) – Pandas data frame.

  • producers (list of openclean.function.eval.base.EvalFunctions) – List of evaluation functions that are used as data (series) producer.

Return type

pd.Series or list

openclean.function.eval.base.to_column_eval(value: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]) openclean.function.eval.base.EvalFunction

Convert a value into an evaluation function. If the value s not already an evaluation function, a column evaluation function is returned.

Parameters

values (string, int, or openclean.function.eval.base.EvalFunction) – Value that is converted to an evaluation function.

Return type

openclean.function.eval.base.EvalFunction

openclean.function.eval.base.to_const_eval(value)

Ensure that the value is an evaluation function. If the given argument is not an evaluation function the value is wrapped as a constant value.

Parameters

value (openclean.function.eval.base.EvalFunction or scalar) – Value that is represented as an evaluation function.

Return type

openclean.function.eval.base.EvalFunction

openclean.function.eval.base.to_eval(producers: typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, float, datetime.datetime, typing.List[typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, float, datetime.datetime]]], factory: typing.Optional[typing.Callable] = <function to_column_eval>) List[openclean.function.eval.base.EvalFunction]

Convert a single input column or a list of input column into a list of evaluation functions. The optional factory function (cls) is used to create instances of an evaluation function for scalar argument values.

Parameters
  • producers (int, string, EvaluationFunction, or list) – Specification of one or more input producers for an evaluation function.

  • factory (callable) – Factory for evaluation functions that can be instantiated using a single scalar argument (e.g., Col or Const).

Return type

list