openclean.operator.transform.update module

Data frame transformation operator that updates values in columns of a data frame.

class openclean.operator.transform.update.Update(columns: Union[int, str, List[Union[str, int]]], func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction])

Bases: openclean.operator.stream.processor.StreamProcessor, openclean.operator.base.DataFrameTransformer

Data frame transformer that updates values in data frame column(s) using a given update function. The function is executed for each row and the resulting values replace the original cell values in the row for all listed columns (in their order of appearance in the columns list).

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamFunctionHandler

Factory pattern for stream consumer. Returns an instance of a stream consumer that updates values in a data stream row.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamFunctionHandler

transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Modify rows in the given data frame. Returns a modified data frame where values have been updated by the results of evaluating the associated row update function.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

pandas.DataFrame

openclean.operator.transform.update.get_update_function(func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction], columns: Union[int, str, List[Union[str, int]]]) openclean.function.eval.base.EvalFunction

Helper method to ensure that the function that is passed to an update operator is an evaluation function that was properly initialized.

If the function argument is a dictionary it is converted into a lookup table. if the argument is a scalar value it is converted into a constant evaluation function. Special attention is given to conditional replacement functions that do not have their pass-through function set.

Parameters
  • func (scalar, dict, callable, openclean.function.value.base.ValueFunction,) – or openclean.function.eval.base.EvalFunction Specification of the (resulting) evaluation function that is used to generate the updated values for each row in the data frame.

  • columns (list(int or string)) – List of column index positions or column names.

Return type

openclean.function.eval.base.EvalFunction

openclean.operator.transform.update.swap(df: pandas.core.frame.DataFrame, col1: Union[int, str], col2: Union[int, str]) pandas.core.frame.DataFrame

Swap values in two columns of a data frame. Replaces values in column one with values in column two and vice versa for each row in a data frame.

Raises a ValueError if the column arguments are not of type int or string.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • col1 (int or string) – Single column index or name.

  • col12 (int or string) – Single column index or name.

Return type

pd.DataFrame

Raises

ValueError

openclean.operator.transform.update.update(df: pandas.core.frame.DataFrame, columns: Union[int, str, List[Union[str, int]]], func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction]) pandas.core.frame.DataFrame

Update function for data frames. Returns a modified data frame where values in the specified columns have been modified using the given update function.

The update function is executed for each data frame row. The number of values returned by the function must match the number of columns that are being modified. Returned values are used to update column values in the same order as columns are specified in the columns list.

The function that is used to generate the update values will be a evaluation function. The user has the option to also provide a constant value, a lookup dictionary, or a callable (or value function) that accepts a single value.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • columns (int, string, or list(int or string), optional) – Single column or list of column index positions or column names.

  • func (scalar, dict, callable, openclean.function.value.base.ValueFunction,) – or openclean.function.eval.base.EvalFunction Specification of the (resulting) evaluation function that is used to generate the updated values for each row in the data frame.

Return type

pd.DataFrame