openclean.operator.transform.select module

Functions and classes that implement the column selection operator in openclean.

class openclean.operator.transform.select.Select(columns: Union[int, str, List[Union[str, int]]], names: Optional[Union[str, List[str]]] = None)

Bases: openclean.operator.stream.processor.StreamProcessor, openclean.operator.base.DataFrameTransformer

Data frame transformer that selects a list of columns from a data frame. The output is a data frame that contains all rows from an input data frame but only those columns that are included in a given select clause.

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamFunctionHandler

Factory pattern for stream consumer. Returns an instance of a stream consumer that filters columns from data frame rows using the associated list of columns (i.e., the select clause).

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamFunctionHandler

transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Return a data frame that contains all rows but only those columns from the given input data frame that are included in the select clause.

Raises a value error if the list of columns contains an item that cannot be matched to a column in the given data frame.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

pandas.DataFrame

Raises

ValueError

openclean.operator.transform.select.select(df: pandas.core.frame.DataFrame, columns: Union[int, str, List[Union[str, int]]], names: Optional[Union[str, List[str]]] = None) pandas.core.frame.DataFrame

Projection operator that selects a list of columns from a data frame. Returns a data frame that contains only thoses columns that are included in the given select clause. The optional list of names allows to rename the columns in the resulting data frame. If the list of names is given, it has to be of the same length as the list of columns.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

  • names (string or list(string)) – Single name or list of names for the resulting columns.

Return type

pandas.DataFrame

Raises

ValueError