openclean.data.groupby module

Base class for data frame groupings. A data frame grouping splits a data frame into multiple (potentially overlapping) data frames with the same schema.

class openclean.data.groupby.ConflictSummary

Bases: collections.defaultdict

Summarize conflicts in one or more attributes for groups in a data frame grouping.

add(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]])

Add a list of conflicting values from a data frame group.

Parameters: values (list of scalar or tuple) – List of conflicting values.

most_common(n: Optional[int] = 10) → List[Tuple[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], int]]

Ranking of the n most common values in conflicts.

Parameters: n (int, default=10) – Number of values to include in the ranking.
Return type: list of value and count pairs

class openclean.data.groupby.DataFrameGrouping(df: pandas.core.frame.DataFrame)

Bases: object

A data frame grouping is a mapping of key values to subsets of rows for a given data frame.

Internally, this class contains a data frame and a mapping of key values to lists of row indices for the rows in each group. There are currently no restrictions on the number of groups that each of the original data frame rows can occur in.

The grouping provides a basic set of methods to access the individual data frames that represent the different groups.

add(key: str, rows: List[int]) → openclean.data.groupby.DataFrameGrouping

Add a new group to the collection. Raises a ValueError if a group with the given key already exists. Returns a reference to this object instance.

Parameters

key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
rows (list(int)) – List of indices for rows in the original data frame that are part of the added group. Note that this is not the value for the index of a row in the data frame but the index into the array of rows i.e. the position of the row in the df.

Return type

openclean.data.groupby.DataFrameGrouping

Raises

ValueError –

property columns: List[str]

Get the names of columns in the schema of the grouped data frame.

Return type: list of string

get(key: str) → pandas.core.frame.DataFrame

Get the data frame that is associated with the given key. Returns None if the given key does not exist in the grouping.

Parameters: key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
Return type: pd.DataFrame

groups() → Iterator[Tuple[str, pandas.core.frame.DataFrame]]

Synonym for items(). Allows to iterate over the groups (and thier associated keys) in this grouping.

Return type: (scalar or tuple, pd.DataFrame)

items() → Iterator[Tuple[str, pandas.core.frame.DataFrame]]

Iterate over the groups in this grouping. Returns pairs of group key and the associated data frame containing the rows from the original data frame that are in this group.

Return type: (scalar or tuple, pd.DataFrame)

keys() → Set[str]

Get set of group keys.

Return type: set

rows(key: str) → List[int]

Get the row indices for associated with the given key. Returns None if the key doesn’t exist.

Parameters: key (scalar or tuple) – Key values generated by the GroupBy operator for the rows in the dataframe
Return type: list

values(key: str, columns: Union[int, str, List[Union[str, int]]]) → collections.Counter

Get values (and their frequency counts) for columns of rows in the group that is identified by the given key.

Parameters

key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

Return type

collections.Counter

class openclean.data.groupby.DataFrameViolation(df: pandas.core.frame.DataFrame)

Bases: openclean.data.groupby.DataFrameGrouping

Subclass of DataFrame Grouping which maintains extra meta value information related to a violation.

add(key: str, rows: List[int], meta: Optional[collections.Counter] = None) → openclean.data.groupby.DataFrameViolation

Adds key:meta and key:rows to self._meta and self._groups respectively.

Parameters

key (str) – key for the group
rows (list) – list of indices for the group
meta (Counter (Optional)) – meta data counters for the group

Return type

openclean.data.groupby.DataFrameViolation

conflicts(key: str, columns: Union[int, str, List[Union[str, int]]]) → collections.Counter

Synonym to get set of values from columns in rows in a group.

Parameters

key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

Return type

collections.Counter

get_meta(key: str) → collections.Counter

Returns the counter for a key

Parameters: key (str) – the key for the dataframe group
Return type: collections.Counter

summarize_conflicts(columns: Union[int, str, List[Union[str, int]]]) → openclean.data.groupby.ConflictSummary

Get a summary of conflicting values in one or more attributes within the individual groups in the grouping.

A conflict is defined as a set of multiple values that occur in the specified column(s) within a group in this grouping. For each value that occurs in a conflict the summary maintains (i) the number of groups where the value appeared in a conflict, and (ii) a list of conflicting values with a count for the number of groups that these values conflicted in.

Parameters: columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.
Return type: openclean.data.groupby.ConflictSummary

class openclean.data.groupby.ValueConflicts(count: int = 0, values: collections.Counter = <factory>)

Bases: object

Information about the number of groups and the values that a value occurs in as a conflicting value.

count: int = 0

values: collections.Counter