openclean.operator.map.violations module

Functions that return the Dataframe Violation class with violations of functional dependencies or keys in a pandas dataframe.

class openclean.operator.map.violations.Violations(lhs, rhs=None, func=None, having=None)

Bases: openclean.operator.base.DataFrameMapper

Violations class that: 1) takes the left side and right side column names 2) generates a new key from the values (func callable) 3) identifies any tuples violating specified rules (having callable) 4) and returns them as a DataFrameViolation object.

map(df: pandas.core.frame.DataFrame) openclean.data.groupby.DataFrameViolation

Identifies violations and maps the pandas DataFrame into a DataFrameViolation object.

Parameters

df (pandas.DataFrame) – Dataframe to find violations in

Return type

openclean.data.groupby.DataFrameViolation

static select(condition: Union[Callable, int], meta: collections.Counter) bool

Given a dataframe and a condition, returns a bool of whether the group should be selected.

Parameters
  • condition (int or callable) – if not provided, the group is selected if int, the group’s number of rows is checked against the condition if callable, the meta is passed to it. The callable should return a boolean

  • meta (Counter) – the meta Counter for the group/df under consideration

Return type

bool

Raises

TypeError

openclean.operator.map.violations.fd_violations(df: pandas.core.frame.DataFrame, lhs: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], rhs: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]]) openclean.data.groupby.DataFrameViolation

Checks for violations of a functional dependency in the given data frame.

Parameters
  • df (pd.DataFrame) – the input pandas dataframe

  • lhs (int, string, openclean.function.eval.base.EvalFunction, or list) – Generator that forms the determinant key values.

  • rhs (list or str) – Generator that forms the dependant key values.

Return type

openclean.data.groupby.DataFrameViolation

openclean.operator.map.violations.key_violations(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, n: Optional[int] = - 1) openclean.data.groupby.DataFrameViolation

Checks for violations of a key constraint in the given data frame. An optional func can be given to be used as a custom key generator function that operates on the specified columns. The optional parameter n can be used to select groups with the exact number of n violations.

Parameters
  • df (pd.DataFrame) – the input pandas dataframe

  • columns (int, string, openclean.function.eval.base.EvalFunction, or list) – Generator to extract group by keys from data frame rows.

  • func

    openclean.function.eval.base.value.ValueFunction,

    callable,

    ), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.

  • n (int, default=-1) – Option to filter out groups with not exactly n violations.

Return type

openclean.data.groupby.DataFrameViolation