openclean.operator.base module
Abstract classes for openclean pipeline operators. There are four primary types of operators:
DataFrameTransformer: The data frame transformer takes a DataFrame as input and generates a DataFrame as output.
DataFrameMapper: The group generator takes as input a DataFrame and outputs a GroupBy.
DataGroupReducer: The group reducer takes a GroupBy as input and outputs a DataFrame.
DataGroupTransformer: The group transformers takes aa GroupBy as
input and outputs a GroupBy.
In addition to the output DatFrame’s or GroupBy object, each operator can output a stage state object.
- class openclean.operator.base.DataFrameMapper
Bases:
openclean.operator.base.PipelineStage
Abstract class for pipeline components that take a data frame as input and return a data frame grouping as output.
- abstract map(df)
This is the main method that each subclass of the mapper has to implement. The input is a pandas data frame. The output is a group of pandas data frames.
- Parameters
df (pd.DataFrame) – Input data frame.
- Return type
- class openclean.operator.base.DataFrameSplitter
Bases:
openclean.operator.base.PipelineStage
Abstract class for pipeline components that take a data frame as input and returns two data frames as output. This is a special case of the data frame mapper.
- abstract split(df)
This is the main method that each subclass of the splitter has to implement. The input is a pandas data frame. The output are two data frames.
- Parameters
df (pd.DataFrame) – Input data frame.
- Return type
pd.DataFrame, pd.DataFrame
- class openclean.operator.base.DataFrameTransformer
Bases:
openclean.operator.base.PipelineStage
Abstract class for pipeline components that take a data frame as input and return a transformed data frame as output.
- abstract transform(df)
This is the main method that each subclass of the transformer has to implement. The input is a pandas data frame. The output is a modified data frame.
- Parameters
df (pd.DataFrame) – Input data frame.
- Return type
pd.DataFrame
- class openclean.operator.base.DataGroupReducer
Bases:
openclean.operator.base.PipelineStage
Abstract class for pipeline components that take a group of data frames as input and return a single data frame as output.
- abstract reduce(groups: openclean.data.groupby.DataFrameGrouping) pandas.core.frame.DataFrame
This is the main method that each subclass of the group reducer has to implement. The input is a pandas data frame grouping. The output is a single data frame.
- Parameters
groups (openclean.data.groupby.DataFrameGrouping) – Grouping of pandas data frames.
- Return type
pd.DataFrame
- class openclean.operator.base.DataGroupTransformer
Bases:
openclean.operator.base.PipelineStage
Abstract class for pipeline components that take a data frame grouping as input and return a transformed grouping.
- abstract transform(groups)
This is the main method that each subclass of the transformer has to implement. The input is a pandas data frame grouping. The output is a modified data frame grouping.
- Parameters
groups (openclean.data.groupby.DataFrameGrouping) – Grouping of pandas data frames.
- Return type
- class openclean.operator.base.PipelineStage
Bases:
object
Generic pipline stage interface.
- is_frame_mapper() bool
Test whether a pipeline operator is a (sub-)class of the data frame mapper type.
- Return type
bool
- is_frame_splitter() bool
Test whether a pipeline operator is a (sub-)class of the data frame splitter type.
- Return type
bool
- is_frame_transformer() bool
Test whether a pipeline operator is a (sub-)class of the data frame transformer type.
- Return type
bool
- is_group_reducer() bool
Test whether a pipeline operator is a (sub-)class of the data group reducer type.
- Return type
bool
- is_group_transformer() bool
Test whether a pipeline operator is a (sub-)class of the data group transformer type.
- Return type
bool