openclean.operator.map.violations module
Functions that return the Dataframe Violation class with violations of functional dependencies or keys in a pandas dataframe.
- class openclean.operator.map.violations.Violations(lhs, rhs=None, func=None, having=None)
Bases:
openclean.operator.base.DataFrameMapper
Violations class that: 1) takes the left side and right side column names 2) generates a new key from the values (func callable) 3) identifies any tuples violating specified rules (having callable) 4) and returns them as a DataFrameViolation object.
- map(df: pandas.core.frame.DataFrame) openclean.data.groupby.DataFrameViolation
Identifies violations and maps the pandas DataFrame into a DataFrameViolation object.
- Parameters
df (pandas.DataFrame) – Dataframe to find violations in
- Return type
- static select(condition: Union[Callable, int], meta: collections.Counter) bool
Given a dataframe and a condition, returns a bool of whether the group should be selected.
- Parameters
condition (int or callable) – if not provided, the group is selected if int, the group’s number of rows is checked against the condition if callable, the meta is passed to it. The callable should return a boolean
meta (Counter) – the meta Counter for the group/df under consideration
- Return type
bool
- Raises
TypeError –
- openclean.operator.map.violations.fd_violations(df: pandas.core.frame.DataFrame, lhs: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], rhs: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]]) openclean.data.groupby.DataFrameViolation
Checks for violations of a functional dependency in the given data frame.
- Parameters
df (pd.DataFrame) – the input pandas dataframe
lhs (int, string, openclean.function.eval.base.EvalFunction, or list) – Generator that forms the determinant key values.
rhs (list or str) – Generator that forms the dependant key values.
- Return type
- openclean.operator.map.violations.key_violations(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], func: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, n: Optional[int] = - 1) openclean.data.groupby.DataFrameViolation
Checks for violations of a key constraint in the given data frame. An optional func can be given to be used as a custom key generator function that operates on the specified columns. The optional parameter n can be used to select groups with the exact number of n violations.
- Parameters
df (pd.DataFrame) – the input pandas dataframe
columns (int, string, openclean.function.eval.base.EvalFunction, or list) – Generator to extract group by keys from data frame rows.
func –
- openclean.function.eval.base.value.ValueFunction,
callable,
), default=None Optional callable or value function that is used to generate a group by key from the values that are generate by the columns clause. This is a short cut to creating an evaluation function with columns as input and func as the evaluated function.
n (int, default=-1) – Option to filter out groups with not exactly n violations.
- Return type