openclean.profiling.anomalies.domain module

Domain outlier detector.

class openclean.profiling.anomalies.domain.DomainOutliers(domain: Union[pandas.core.frame.DataFrame, Dict, List, Set], ignore_case: Optional[bool] = False)

Bases: openclean.profiling.anomalies.conditional.ConditionalOutliers

The domain outlier detector returns the list of values from a given data stream that do not occur in a ground truth domain.

outlier(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) bool

Test if a given value is in the associated ground truth domain. If the value is not in the domain it is considered an outlier.

Returns a dictionary for values that are classified as outliers that contains one element ‘value’ for the tested value.

Parameters

value (scalar or tuple) – Value that is being tested for the outlier condition.

Return type

bool

openclean.profiling.anomalies.domain.domain_outliers(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], domain: Union[pandas.core.frame.DataFrame, Dict, List, Set], ignore_case: Optional[bool] = False) List

The domain outlier detector returns the list of values from a given data stream that do not occur in a ground truth domain.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.

  • domain (pandas.DataFrame, pandas.Series, or object) – Data frame or series, or any object that implements the __contains__ method.

  • ignore_case (bool, optional) – Ignore case for domain inclusion checking

Return type

list