openclean.data.util module

Helper functions to transform data frames into lists or mappings and vice versa.

openclean.data.util.get_value(row: Union[List, Tuple], columns: List[int]) Union[int, float, str, datetime.datetime, Tuple]

Helper function to get the value for a single column or multiple columns from a data frame row. If columns contains only a single column index the value at that index position in the given row is returned. If columns contains multiple column indices a tuple with the row values for all the specified columns is returned.

Parameters
  • row (list or tuple of scalar values) – Row in a data frame.

  • columns (list of integer) – List of index positions for extracted column values.

Return type

scalar or tuple of scalar

openclean.data.util.repair_mapping(df: pandas.core.frame.DataFrame, key: Union[int, str, List[Union[str, int]]], value: Union[int, str, List[Union[str, int]]]) Dict

Create a lookup table from the given data frame that represents a repair mapping for a given combination of lookup key and target value. The key columns and value columns represet the columns from which the lookup key and mapping target value are generated.

The resulting mapping is a dictionary that contains entries for all key values that were mapped to target values that are different from the key value.

The function will raise an error if no unique mapping can be defined from the values in the given data frame.

Parameters
  • df (pd.DataFrame) – Pandas data frame.

  • key (Columns) – Single column or list of column names or index positions. The specified column(s) are used to generate the mapping key value.

  • value (Columns) – Single column or list of column names or index positions. The specified column(s) are used to generate the mapping target value.

Return type

dict

openclean.data.util.to_set(data)

Create a set of distinct values (rows) for a given data frame or data series. For data frames with multiple columns, each row is converted into a tuple that is added to the set.

Parameters

df (pandas.DataFrame) – Input data frame.

Return type

sets

Raises

ValueError