openclean.data.groupby module
Base class for data frame groupings. A data frame grouping splits a data frame into multiple (potentially overlapping) data frames with the same schema.
- class openclean.data.groupby.ConflictSummary
Bases:
collections.defaultdict
Summarize conflicts in one or more attributes for groups in a data frame grouping.
- add(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]])
Add a list of conflicting values from a data frame group.
- Parameters
values (list of scalar or tuple) – List of conflicting values.
- most_common(n: Optional[int] = 10) List[Tuple[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], int]]
Ranking of the n most common values in conflicts.
- Parameters
n (int, default=10) – Number of values to include in the ranking.
- Return type
list of value and count pairs
- class openclean.data.groupby.DataFrameGrouping(df: pandas.core.frame.DataFrame)
Bases:
object
A data frame grouping is a mapping of key values to subsets of rows for a given data frame.
Internally, this class contains a data frame and a mapping of key values to lists of row indices for the rows in each group. There are currently no restrictions on the number of groups that each of the original data frame rows can occur in.
The grouping provides a basic set of methods to access the individual data frames that represent the different groups.
- add(key: str, rows: List[int]) openclean.data.groupby.DataFrameGrouping
Add a new group to the collection. Raises a ValueError if a group with the given key already exists. Returns a reference to this object instance.
- Parameters
key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
rows (list(int)) – List of indices for rows in the original data frame that are part of the added group. Note that this is not the value for the index of a row in the data frame but the index into the array of rows i.e. the position of the row in the df.
- Return type
- Raises
ValueError –
- property columns: List[str]
Get the names of columns in the schema of the grouped data frame.
- Return type
list of string
- get(key: str) pandas.core.frame.DataFrame
Get the data frame that is associated with the given key. Returns None if the given key does not exist in the grouping.
- Parameters
key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
- Return type
pd.DataFrame
- groups() Iterator[Tuple[str, pandas.core.frame.DataFrame]]
Synonym for items(). Allows to iterate over the groups (and thier associated keys) in this grouping.
- Return type
(scalar or tuple, pd.DataFrame)
- items() Iterator[Tuple[str, pandas.core.frame.DataFrame]]
Iterate over the groups in this grouping. Returns pairs of group key and the associated data frame containing the rows from the original data frame that are in this group.
- Return type
(scalar or tuple, pd.DataFrame)
- keys() Set[str]
Get set of group keys.
- Return type
set
- rows(key: str) List[int]
Get the row indices for associated with the given key. Returns None if the key doesn’t exist.
- Parameters
key (scalar or tuple) – Key values generated by the GroupBy operator for the rows in the dataframe
- Return type
list
- values(key: str, columns: Union[int, str, List[Union[str, int]]]) collections.Counter
Get values (and their frequency counts) for columns of rows in the group that is identified by the given key.
- Parameters
key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.
- Return type
collections.Counter
- class openclean.data.groupby.DataFrameViolation(df: pandas.core.frame.DataFrame)
Bases:
openclean.data.groupby.DataFrameGrouping
Subclass of DataFrame Grouping which maintains extra meta value information related to a violation.
- add(key: str, rows: List[int], meta: Optional[collections.Counter] = None) openclean.data.groupby.DataFrameViolation
Adds key:meta and key:rows to self._meta and self._groups respectively.
- Parameters
key (str) – key for the group
rows (list) – list of indices for the group
meta (Counter (Optional)) – meta data counters for the group
- Return type
- conflicts(key: str, columns: Union[int, str, List[Union[str, int]]]) collections.Counter
Synonym to get set of values from columns in rows in a group.
- Parameters
key (scalar or tuple) – Key value generated by the GroupBy operator for the rows in the data frame.
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.
- Return type
collections.Counter
- get_meta(key: str) collections.Counter
Returns the counter for a key
- Parameters
key (str) – the key for the dataframe group
- Return type
collections.Counter
- summarize_conflicts(columns: Union[int, str, List[Union[str, int]]]) openclean.data.groupby.ConflictSummary
Get a summary of conflicting values in one or more attributes within the individual groups in the grouping.
A conflict is defined as a set of multiple values that occur in the specified column(s) within a group in this grouping. For each value that occurs in a conflict the summary maintains (i) the number of groups where the value appeared in a conflict, and (ii) a list of conflicting values with a count for the number of groups that these values conflicted in.
- Parameters
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.
- Return type