openclean.operator.base module

Abstract classes for openclean pipeline operators. There are four primary types of operators:

  • DataFrameTransformer: The data frame transformer takes a DataFrame as input and generates a DataFrame as output.

  • DataFrameMapper: The group generator takes as input a DataFrame and outputs a GroupBy.

  • DataGroupReducer: The group reducer takes a GroupBy as input and outputs a DataFrame.

  • DataGroupTransformer: The group transformers takes aa GroupBy as

input and outputs a GroupBy.

In addition to the output DatFrame’s or GroupBy object, each operator can output a stage state object.

class openclean.operator.base.DataFrameMapper

Bases: openclean.operator.base.PipelineStage

Abstract class for pipeline components that take a data frame as input and return a data frame grouping as output.

abstract map(df)

This is the main method that each subclass of the mapper has to implement. The input is a pandas data frame. The output is a group of pandas data frames.

Parameters

df (pd.DataFrame) – Input data frame.

Return type

openclean.data.groupby.DataFrameGrouping

class openclean.operator.base.DataFrameSplitter

Bases: openclean.operator.base.PipelineStage

Abstract class for pipeline components that take a data frame as input and returns two data frames as output. This is a special case of the data frame mapper.

abstract split(df)

This is the main method that each subclass of the splitter has to implement. The input is a pandas data frame. The output are two data frames.

Parameters

df (pd.DataFrame) – Input data frame.

Return type

pd.DataFrame, pd.DataFrame

class openclean.operator.base.DataFrameTransformer

Bases: openclean.operator.base.PipelineStage

Abstract class for pipeline components that take a data frame as input and return a transformed data frame as output.

abstract transform(df)

This is the main method that each subclass of the transformer has to implement. The input is a pandas data frame. The output is a modified data frame.

Parameters

df (pd.DataFrame) – Input data frame.

Return type

pd.DataFrame

class openclean.operator.base.DataGroupReducer

Bases: openclean.operator.base.PipelineStage

Abstract class for pipeline components that take a group of data frames as input and return a single data frame as output.

abstract reduce(groups: openclean.data.groupby.DataFrameGrouping) pandas.core.frame.DataFrame

This is the main method that each subclass of the group reducer has to implement. The input is a pandas data frame grouping. The output is a single data frame.

Parameters

groups (openclean.data.groupby.DataFrameGrouping) – Grouping of pandas data frames.

Return type

pd.DataFrame

class openclean.operator.base.DataGroupTransformer

Bases: openclean.operator.base.PipelineStage

Abstract class for pipeline components that take a data frame grouping as input and return a transformed grouping.

abstract transform(groups)

This is the main method that each subclass of the transformer has to implement. The input is a pandas data frame grouping. The output is a modified data frame grouping.

Parameters

groups (openclean.data.groupby.DataFrameGrouping) – Grouping of pandas data frames.

Return type

openclean.data.groupby.DataFrameGrouping

class openclean.operator.base.PipelineStage

Bases: object

Generic pipline stage interface.

is_frame_mapper() bool

Test whether a pipeline operator is a (sub-)class of the data frame mapper type.

Return type

bool

is_frame_splitter() bool

Test whether a pipeline operator is a (sub-)class of the data frame splitter type.

Return type

bool

is_frame_transformer() bool

Test whether a pipeline operator is a (sub-)class of the data frame transformer type.

Return type

bool

is_group_reducer() bool

Test whether a pipeline operator is a (sub-)class of the data group reducer type.

Return type

bool

is_group_transformer() bool

Test whether a pipeline operator is a (sub-)class of the data group transformer type.

Return type

bool