openclean.data.groupby module

Base class for data frame groupings. A data frame grouping splits a data frame into multiple (potentially overlapping) data frames with the same schema.

class openclean.data.groupby.ConflictSummary

Bases: collections.defaultdict

Summarize conflicts in one or more attributes for groups in a data frame grouping.

add(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]])

Add a list of conflicting values from a data frame group.

Parameters

values (list of scalar or tuple) – List of conflicting values.

most_common(n: Optional[int] = 10) List[Tuple[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], int]]

Ranking of the n most common values in conflicts.

Parameters

n (int, default=10) – Number of values to include in the ranking.

Return type

list of value and count pairs

class openclean.data.groupby.DataFrameGrouping(df: pandas.core.frame.DataFrame)

Bases: object

A data frame grouping is a mapping of key values to subsets of rows for a given data frame.

Internally, this class contains a data frame and a mapping of key values to lists of row indices for the rows in each group. There are currently no restrictions on the number of groups that each of the original data frame rows can occur in.

The grouping provides a basic set of methods to access the individual data frames that represent the different groups.

add(key: str, rows: List[int]) openclean.data.groupby.DataFrameGrouping

Add a new group to the collection. Raises a ValueError if a group with the given key already exists. Returns a reference to this object instance.

Parameters
  • key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.

  • rows (list(int)) – List of indices for rows in the original data frame that are part of the added group. Note that this is not the value for the index of a row in the data frame but the index into the array of rows i.e. the position of the row in the df.

Return type

openclean.data.groupby.DataFrameGrouping

Raises

ValueError

property columns: List[str]

Get the names of columns in the schema of the grouped data frame.

Return type

list of string

get(key: str) pandas.core.frame.DataFrame

Get the data frame that is associated with the given key. Returns None if the given key does not exist in the grouping.

Parameters

key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.

Return type

pd.DataFrame

groups() Iterator[Tuple[str, pandas.core.frame.DataFrame]]

Synonym for items(). Allows to iterate over the groups (and thier associated keys) in this grouping.

Return type

(scalar or tuple, pd.DataFrame)

items() Iterator[Tuple[str, pandas.core.frame.DataFrame]]

Iterate over the groups in this grouping. Returns pairs of group key and the associated data frame containing the rows from the original data frame that are in this group.

Return type

(scalar or tuple, pd.DataFrame)

keys() Set[str]

Get set of group keys.

Return type

set

rows(key: str) List[int]

Get the row indices for associated with the given key. Returns None if the key doesn’t exist.

Parameters

key (scalar or tuple) – Key values generated by the GroupBy operator for the rows in the dataframe

Return type

list

values(key: str, columns: Union[int, str, List[Union[str, int]]]) collections.Counter

Get values (and their frequency counts) for columns of rows in the group that is identified by the given key.

Parameters
  • key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.

  • columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

Return type

collections.Counter

class openclean.data.groupby.DataFrameViolation(df: pandas.core.frame.DataFrame)

Bases: openclean.data.groupby.DataFrameGrouping

Subclass of DataFrame Grouping which maintains extra meta value information related to a violation.

add(key: str, rows: List[int], meta: Optional[collections.Counter] = None) openclean.data.groupby.DataFrameViolation

Adds key:meta and key:rows to self._meta and self._groups respectively.

Parameters
  • key (str) – key for the group

  • rows (list) – list of indices for the group

  • meta (Counter (Optional)) – meta data counters for the group

Return type

openclean.data.groupby.DataFrameViolation

conflicts(key: str, columns: Union[int, str, List[Union[str, int]]]) collections.Counter

Synonym to get set of values from columns in rows in a group.

Parameters
  • key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.

  • columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

Return type

collections.Counter

get_meta(key: str) collections.Counter

Returns the counter for a key

Parameters

key (str) – the key for the dataframe group

Return type

collections.Counter

summarize_conflicts(columns: Union[int, str, List[Union[str, int]]]) openclean.data.groupby.ConflictSummary

Get a summary of conflicting values in one or more attributes within the individual groups in the grouping.

A conflict is defined as a set of multiple values that occur in the specified column(s) within a group in this grouping. For each value that occurs in a conflict the summary maintains (i) the number of groups where the value appeared in a conflict, and (ii) a list of conflicting values with a count for the number of groups that these values conflicted in.

Parameters

columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.

Return type

openclean.data.groupby.ConflictSummary

class openclean.data.groupby.ValueConflicts(count: int = 0, values: collections.Counter = <factory>)

Bases: object

Information about the number of groups and the values that a value occurs in as a conflicting value.

count: int = 0
values: collections.Counter