openclean.profiling.anomalies.frequency module

Operators for frequency outlier detection.

class openclean.profiling.anomalies.frequency.FrequencyOutlierResults(iterable=(), /)

Bases: list

Frequency outlier results are a list of dictionaries. Each dictionary contains information about a detected outlier value (‘value’) and additional frequency metadata (‘metadata’: {‘count’, ‘frequency’}).

This class provides some basic functionality to access the individual pieces of information from these dictionaries.

add(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], count: int, frequency: Optional[float] = None)

Add a new outlier to the list.

Parameters
  • value (scalar or tuple) – The outlier value.

  • count (int) – Value frequency count.

  • frequency (float, default=None) – Normalized value frequency (if a normalizer as used).

counts() collections.Counter

Get a mapping of outlier values to their frequency counts.

Return type

collections.Counter

frequencies() Dict

Get a mapping of outlier values to their normalized frequencies.

Return type

dict

Raises

KeyError

values() List

Get only the list of outlier vaues.

Return type

list

class openclean.profiling.anomalies.frequency.FrequencyOutliers(threshold: typing.Union[typing.Callable, int, float], normalize: typing.Optional[openclean.function.value.normalize.numeric.NumericNormalizer] = <openclean.function.value.normalize.numeric.DivideByTotal object>)

Bases: openclean.profiling.anomalies.base.AnomalyDetector

Detect frequency outliers for values in a given list. A value is considered an outlier if its relative frequency in the list satisfies the given threshold predicate.

process(values: collections.Counter) openclean.profiling.anomalies.frequency.FrequencyOutlierResults

Normalize the frequency counts in the given mapping. Returns all values that satisfy the threshold constraint together with their normalized (and absolute) frequencies.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

openclean.profiling.anomalies.frequency.FrequencyOutlierResults

openclean.profiling.anomalies.frequency.frequency_outliers(df: pandas.core.frame.DataFrame, columns: typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, typing.List[typing.Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], threshold: typing.Union[typing.Callable, int, float], normalize: typing.Optional[openclean.function.value.normalize.numeric.NumericNormalizer] = <openclean.function.value.normalize.numeric.DivideByTotal object>) openclean.profiling.anomalies.frequency.FrequencyOutlierResults

Detect frequency outliers for values (or value combinations) in one or more columns of a data frame. A value (combination) is considered an outlier if the relative frequency satisfies the given threshold predicate.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.

  • threshold (callable) – Function that accepts a float (i.e., the relative frequency) and that returns a Boolean value. True indicates that the value (frequency) satisfies the value outlier condition.

  • normalize (openclean.function.value.normalize.NumericNormalizer) – Function used to normalize frequency values befor evaluating the threshold constraint.

Return type

openclean.profiling.anomalies.frequency.FrequencyOutlierResults