openclean.operator.collector.repair module

Repair function for groups of rows that represent constraint violations.

class openclean.operator.collector.repair.ConflictRepair(strategy: Dict[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], openclean.function.value.base.ValueFunction], in_order: Optional[bool] = True)

Bases: openclean.operator.base.DataGroupReducer

The conflict repair function resolves conflicts in data frames (groups) that contain sets of rows that together represent a single violation of a functional dependency constraint. The function resolves conflicts by consolidating values in the (conflicting) data frame columns using a given set of conflict resolution functions (strategy).

reduce(groups: openclean.data.groupby.DataFrameGrouping) → pandas.core.frame.DataFrame

The conflict resolution functions are applied on the respective attribute for each data frame (group) that represents a constraint violation. The modified rows are merged with the remainin (non-conflicting) rows in the data frame that was used for voliation detection. The resuling data frame is returned as the result of the the repair function.

The in_order flag determines the algorithm variant that is used to modify the given data frame. If in_order is True the rows in the resulting data frame are in the same order as in the input data frame. This is achieved by creating a copy of the data frame and updating rows in place. If in_order is False, the rows for the updated groups are appened to a data frame that initially contains only the non-conflicting rows.

Parameters: groups (openclean.data.groupby.DataFrameGrouping) – Grouping of pandas data frames.
Return type: pd.DataFrame

class openclean.operator.collector.repair.ValueExtractor(strategy: Dict[int, openclean.function.value.base.ValueFunction])

Bases: openclean.function.eval.base.EvalFunction

Helper class that extracts and manipulates column values using a list of prepared value functions.

eval(df: pandas.core.frame.DataFrame) → Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Evaluate the value functions on the values of their respective columns. The column values are extracted from the given data frame. The respective value function is then prepared using the column values (if necessary). At last, the column values are modified using the prepared value function.

Parameters: df (pd.DataFrame) – Pandas data frame.
Return type: pd.Series or list

prepare(columns: List[Union[str, histore.document.schema.Column]]) → Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

The function mapping already contains references to columns by their index position. There is nothing to prepare. We raise an error because the ValueExtractor is not intended to be used as a stream function at this point.

Parameters: columns (list of string) – List of column names in the schema of the data stream.
Return type: openclean.data.stream.base.StreamFunction

openclean.operator.collector.repair.conflict_repair(conflicts: openclean.data.groupby.DataFrameGrouping, strategy: Dict[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], openclean.function.value.base.ValueFunction], in_order: Optional[bool] = True) → pandas.core.frame.DataFrame

The conflict repair function resolves conflicts in data frames (groups) that contain sets of rows that together represent a single violation of a functional dependency constraint. The function resolves conflicts by consolidating values in the (conflicting) data frame columns using a given set of conflict resolution functions (strategy).

The idea is that the user specifies a conflict resolution function for each attribute that has multiple values which form a violation of a constraint (e.g., a functional dependency). The conflict resolution strategy is defined as a mapping of column names (or index positions) to value functions for conflict resolution. It is up to the user for which columns they want to provide conflict resolutions functions.

The conflict resolution functions are applied on the respective attribute for each data frame (group) that represents a constraint violation. The modified rows are merged with the remainin (non-conflicting) rows in the data frame that was used for voliation detection. The resuling data frame is returned as the result of the the repair function.

The in_order flag determines the algorithm variant that is used to modify the given data frame. If in_order is True the rows in the resulting data frame are in the same order as in the input data frame. This is achieved by creating a copy of the data frame and updating rows in place. If in_order is False the rows for the updated groups are appened to a data frame that initially contains only the non-conflicting rows. Therefore it is likely that the rows in the resulting data frame are in different order than in the input data frame.

Parameters

conflicts (openclean.data.groupby.DataFrameGrouping) – Grouping of rows from a data frame. Each group represents a set of rows that form a violation of a checked integrity constraint.
strategy (dict) – Mapping of column names or index positions to conflict resolution functions.
in_order (bool, default=True) – Only if the in_order flag is True the resulting data frame is guaranteed to have the rows in the same order as the input data frame.

Return type

pd.DataFrame