openclean.operator.map.groupby module

Class that implements the DataframeMapper abstract class to perform groupby operations on a pandas dataframe.

class openclean.operator.map.groupby.GroupBy(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None)

Bases: openclean.operator.base.DataFrameMapper

GroupBy class that takes in the column names to group on and a function (optional), performs the groupby and returns a DataFrameGrouping object.

map(df: pandas.core.frame.DataFrame) openclean.data.groupby.DataFrameGrouping

transforms and maps a pandas DataFrame into a DataFrameGrouping object.

Parameters

df (pandas.DataFrame) – Dataframe to transform using groupby

Return type

openclean.data.groupby.DataFrameGrouping

static select(group: pandas.core.frame.DataFrame, condition: Union[Callable, int]) bool

Given a dataframe and a condition, returns a bool of whether the group should be selected.

Parameters
  • group (pd.DataFrame) – the group/df under consideration

  • condition (int or callable) – if not provided, the group is selected if int, the group’s number of rows is checked against the condition if callable, the group is passed to it. The callable should return a boolean

Return type

bool

Raises

TypeError

openclean.operator.map.groupby.get_eval_func(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None) openclean.function.eval.base.EvalFunction

Helper function used to create an evaluation function from a key generator specification.

Parameters
  • columns (int, string, openclean.function.eval.base.EvalFunction, or list) – Single column or evaluation function or a list of columns or evaluation functions. The column(s)/function(s) are used to genrate group by keys for each row in the input data frame.

  • func

    openclean.function.eval.base.value.ValueFunction,

    callable,

    ), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.

Return type

openclean.function.eval.base.EvalFunction

openclean.operator.map.groupby.groupby(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, having: Optional[Union[Callable, int]] = None) openclean.data.groupby.DataFrameGrouping

Groupby function for data frames. Evaluates a new index based on the rows of the dataframe using the input function (optional). The output comprises of a openclean.data.groupby.DataFrameGrouping object.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (int, string, openclean.function.eval.base.EvalFunction, or list, default=None) – Single column or evaluation function or a list of columns or evaluation functions. The column(s)/function(s) are used to genrate group by keys for each row in the input data frame.

  • func

    openclean.function.eval.base.value.ValueFunction,

    callable,

    ), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.

  • having (int or callable, default=None) – If given, group by only returns groups that (i) have a number of rows that equals a given int or (ii) (if a callable is given) we pass the group to that callable as an argument and if the returned result is True the group is included in the returned result. The callable should expect a pandas dataframe and return a boolean.

Return type

openclean.data.groupby.DataFrameGrouping