openclean.operator.transform.insert module

Data frame transformation operator that inserts new columns and rows into a data frame.

class openclean.operator.transform.insert.InsCol(names: Union[str, List[str]], pos: Optional[int] = None, values: Optional[Union[Callable, openclean.function.eval.base.EvalFunction, List, int, float, str, datetime.datetime, Tuple]] = None)

Bases: openclean.operator.stream.processor.StreamProcessor, openclean.operator.base.DataFrameTransformer

Data frame transformer that inserts columns into a data frame. Values for the new column(s) are generated using a given value generator function.

inspos(schema: List[Union[str, histore.document.schema.Column]]) int

Get the insert position for the new column.

Raises a ValueError if the position is invalid.

Parameters

schema (list of string) – Dataset input schema.

Return type

int

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamFunctionHandler

Factory pattern for stream consumer. Returns an instance of a stream consumer that re-orders values in a data stream row.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamFunctionHandler

transform(df)

Modify rows in the given data frame. Returns a modified data frame where columns have been inserted containing results of evaluating the associated value generator function.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

pandas.DataFrame

Raises

ValueError

class openclean.operator.transform.insert.InsRow(pos=None, values=None)

Bases: openclean.operator.base.DataFrameTransformer

Data frame transformer that inserts rows into a data frame. If values is None a single row with all None values will be inserted. Ir values is a list of lists multiple rows will be inserted.

transform(df)

Insert rows in the given data frame. Returns a modified data frame where rows have been added. Raises a ValueError if the specified insert position is invalid or the number of values that are inserted does not match the schema of the given data frame.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

pandas.DataFrame

Raises

ValueError

openclean.operator.transform.insert.inscol(df: pandas.core.frame.DataFrame, names: Union[str, List[str]], pos: Optional[int] = None, values: Optional[Union[int, float, str, datetime.datetime, openclean.function.eval.base.EvalFunction]] = None) pandas.core.frame.DataFrame

Insert function for data frame columns. Returns a modified data frame where columns have been inserted at a given position. Exactly one column is inserted for each given column name. If the insert position is undefined, columns are appended to the data frame. If the position does not reference a valid position (i.e., not between 0 and len(df.columns)) a ValueError is raised.

Values for the inserted columns are generated using a given constant value or evaluation function. If a function is given, it is expected to return exactly one value (e.g., a tuple of len(names)) for each of the inserted columns.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • names (string, or list(string)) – Names of the inserted columns.

  • pos (int, default=None) – Insert position for the new columns. If None, the columns will be appended.

  • values (scalar, tuple, or openclean.function.eval.base.EvalFunction,) – default=None Single value, tuple of values, or evaluation function that is used to generate the values for the inserted column(s). If no default is specified all columns will contain None.

Return type

pd.DataFrame

openclean.operator.transform.insert.insrow(df, pos=None, values=None)

Insert a row into a data frame at a specified position. If the list of row values is given there has to be exactly one value per column in the data frame.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • pos (int, optional) – Insert position for the new row(s). If None, the rows will be appended.

  • values (list, optional) – List or values (to insert one row) or list of lists (to insert multiple rows).

Return type

pandas.DataFrame