openclean.profiling.anomalies.domain module
Domain outlier detector.
- class openclean.profiling.anomalies.domain.DomainOutliers(domain: Union[pandas.core.frame.DataFrame, Dict, List, Set], ignore_case: Optional[bool] = False)
Bases:
openclean.profiling.anomalies.conditional.ConditionalOutliers
The domain outlier detector returns the list of values from a given data stream that do not occur in a ground truth domain.
- outlier(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) bool
Test if a given value is in the associated ground truth domain. If the value is not in the domain it is considered an outlier.
Returns a dictionary for values that are classified as outliers that contains one element ‘value’ for the tested value.
- Parameters
value (scalar or tuple) – Value that is being tested for the outlier condition.
- Return type
bool
- openclean.profiling.anomalies.domain.domain_outliers(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], domain: Union[pandas.core.frame.DataFrame, Dict, List, Set], ignore_case: Optional[bool] = False) List
The domain outlier detector returns the list of values from a given data stream that do not occur in a ground truth domain.
- Parameters
df (pandas.DataFrame) – Input data frame.
columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.
domain (pandas.DataFrame, pandas.Series, or object) – Data frame or series, or any object that implements the __contains__ method.
ignore_case (bool, optional) – Ignore case for domain inclusion checking
- Return type
list