openclean.operator.transform.update module
Data frame transformation operator that updates values in columns of a data frame.
- class openclean.operator.transform.update.Update(columns: Union[int, str, List[Union[str, int]]], func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction])
Bases:
openclean.operator.stream.processor.StreamProcessor
,openclean.operator.base.DataFrameTransformer
Data frame transformer that updates values in data frame column(s) using a given update function. The function is executed for each row and the resulting values replace the original cell values in the row for all listed columns (in their order of appearance in the columns list).
- open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamFunctionHandler
Factory pattern for stream consumer. Returns an instance of a stream consumer that updates values in a data stream row.
- Parameters
schema (list of string) – List of column names in the data stream schema.
- Return type
- transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Modify rows in the given data frame. Returns a modified data frame where values have been updated by the results of evaluating the associated row update function.
- Parameters
df (pandas.DataFrame) – Input data frame.
- Return type
pandas.DataFrame
- openclean.operator.transform.update.get_update_function(func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction], columns: Union[int, str, List[Union[str, int]]]) openclean.function.eval.base.EvalFunction
Helper method to ensure that the function that is passed to an update operator is an evaluation function that was properly initialized.
If the function argument is a dictionary it is converted into a lookup table. if the argument is a scalar value it is converted into a constant evaluation function. Special attention is given to conditional replacement functions that do not have their pass-through function set.
- Parameters
func (scalar, dict, callable, openclean.function.value.base.ValueFunction,) – or openclean.function.eval.base.EvalFunction Specification of the (resulting) evaluation function that is used to generate the updated values for each row in the data frame.
columns (list(int or string)) – List of column index positions or column names.
- Return type
- openclean.operator.transform.update.swap(df: pandas.core.frame.DataFrame, col1: Union[int, str], col2: Union[int, str]) pandas.core.frame.DataFrame
Swap values in two columns of a data frame. Replaces values in column one with values in column two and vice versa for each row in a data frame.
Raises a ValueError if the column arguments are not of type int or string.
- Parameters
df (pd.DataFrame) – Input data frame.
col1 (int or string) – Single column index or name.
col12 (int or string) – Single column index or name.
- Return type
pd.DataFrame
- Raises
ValueError –
- openclean.operator.transform.update.update(df: pandas.core.frame.DataFrame, columns: Union[int, str, List[Union[str, int]]], func: Union[Callable, Dict, openclean.function.eval.base.EvalFunction, int, float, str, datetime.datetime, openclean.function.value.base.ValueFunction]) pandas.core.frame.DataFrame
Update function for data frames. Returns a modified data frame where values in the specified columns have been modified using the given update function.
The update function is executed for each data frame row. The number of values returned by the function must match the number of columns that are being modified. Returned values are used to update column values in the same order as columns are specified in the columns list.
The function that is used to generate the update values will be a evaluation function. The user has the option to also provide a constant value, a lookup dictionary, or a callable (or value function) that accepts a single value.
- Parameters
df (pd.DataFrame) – Input data frame.
columns (int, string, or list(int or string), optional) – Single column or list of column index positions or column names.
func (scalar, dict, callable, openclean.function.value.base.ValueFunction,) – or openclean.function.eval.base.EvalFunction Specification of the (resulting) evaluation function that is used to generate the updated values for each row in the data frame.
- Return type
pd.DataFrame