openclean.operator.map.groupby module
Class that implements the DataframeMapper abstract class to perform groupby operations on a pandas dataframe.
- class openclean.operator.map.groupby.GroupBy(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None)
Bases:
openclean.operator.base.DataFrameMapper
GroupBy class that takes in the column names to group on and a function (optional), performs the groupby and returns a DataFrameGrouping object.
- map(df: pandas.core.frame.DataFrame) openclean.data.groupby.DataFrameGrouping
transforms and maps a pandas DataFrame into a DataFrameGrouping object.
- Parameters
df (pandas.DataFrame) – Dataframe to transform using groupby
- Return type
- static select(group: pandas.core.frame.DataFrame, condition: Union[Callable, int]) bool
Given a dataframe and a condition, returns a bool of whether the group should be selected.
- Parameters
group (pd.DataFrame) – the group/df under consideration
condition (int or callable) – if not provided, the group is selected if int, the group’s number of rows is checked against the condition if callable, the group is passed to it. The callable should return a boolean
- Return type
bool
- Raises
TypeError –
- openclean.operator.map.groupby.get_eval_func(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None) openclean.function.eval.base.EvalFunction
Helper function used to create an evaluation function from a key generator specification.
- Parameters
columns (int, string, openclean.function.eval.base.EvalFunction, or list) – Single column or evaluation function or a list of columns or evaluation functions. The column(s)/function(s) are used to genrate group by keys for each row in the input data frame.
func –
- openclean.function.eval.base.value.ValueFunction,
callable,
), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.
- Return type
- openclean.operator.map.groupby.groupby(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, having: Optional[Union[Callable, int]] = None) openclean.data.groupby.DataFrameGrouping
Groupby function for data frames. Evaluates a new index based on the rows of the dataframe using the input function (optional). The output comprises of a openclean.data.groupby.DataFrameGrouping object.
- Parameters
df (pandas.DataFrame) – Input data frame.
columns (int, string, openclean.function.eval.base.EvalFunction, or list, default=None) – Single column or evaluation function or a list of columns or evaluation functions. The column(s)/function(s) are used to genrate group by keys for each row in the input data frame.
func –
- openclean.function.eval.base.value.ValueFunction,
callable,
), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.
having (int or callable, default=None) – If given, group by only returns groups that (i) have a number of rows that equals a given int or (ii) (if a callable is given) we pass the group to that callable as an argument and if the returned result is True the group is included in the returned result. The callable should expect a pandas dataframe and return a boolean.
- Return type