openclean.profiling.anomalies.frequency module
Operators for frequency outlier detection.
- class openclean.profiling.anomalies.frequency.FrequencyOutlierResults(iterable=(), /)
Bases:
list
Frequency outlier results are a list of dictionaries. Each dictionary contains information about a detected outlier value (‘value’) and additional frequency metadata (‘metadata’: {‘count’, ‘frequency’}).
This class provides some basic functionality to access the individual pieces of information from these dictionaries.
- add(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], count: int, frequency: Optional[float] = None)
Add a new outlier to the list.
- Parameters
value (scalar or tuple) – The outlier value.
count (int) – Value frequency count.
frequency (float, default=None) – Normalized value frequency (if a normalizer as used).
- counts() collections.Counter
Get a mapping of outlier values to their frequency counts.
- Return type
collections.Counter
- frequencies() Dict
Get a mapping of outlier values to their normalized frequencies.
- Return type
dict
- Raises
KeyError –
- values() List
Get only the list of outlier vaues.
- Return type
list
- class openclean.profiling.anomalies.frequency.FrequencyOutliers(threshold: typing.Union[typing.Callable, int, float], normalize: typing.Optional[openclean.function.value.normalize.numeric.NumericNormalizer] = <openclean.function.value.normalize.numeric.DivideByTotal object>)
Bases:
openclean.profiling.anomalies.base.AnomalyDetector
Detect frequency outliers for values in a given list. A value is considered an outlier if its relative frequency in the list satisfies the given threshold predicate.
- process(values: collections.Counter) openclean.profiling.anomalies.frequency.FrequencyOutlierResults
Normalize the frequency counts in the given mapping. Returns all values that satisfy the threshold constraint together with their normalized (and absolute) frequencies.
- Parameters
values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.
- Return type
openclean.profiling.anomalies.frequency.FrequencyOutlierResults
- openclean.profiling.anomalies.frequency.frequency_outliers(df: pandas.core.frame.DataFrame, columns: typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, typing.List[typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], threshold: typing.Union[typing.Callable, int, float], normalize: typing.Optional[openclean.function.value.normalize.numeric.NumericNormalizer] = <openclean.function.value.normalize.numeric.DivideByTotal object>) openclean.profiling.anomalies.frequency.FrequencyOutlierResults
Detect frequency outliers for values (or value combinations) in one or more columns of a data frame. A value (combination) is considered an outlier if the relative frequency satisfies the given threshold predicate.
- Parameters
df (pandas.DataFrame) – Input data frame.
columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.
threshold (callable) – Function that accepts a float (i.e., the relative frequency) and that returns a Boolean value. True indicates that the value (frequency) satisfies the value outlier condition.
normalize (openclean.function.value.normalize.NumericNormalizer) – Function used to normalize frequency values befor evaluating the threshold constraint.
- Return type
openclean.profiling.anomalies.frequency.FrequencyOutlierResults