openclean.profiling.anomalies.pattern module

Outlier detection algorithms using regular expressions. Pattern outliers in general are considered values that do not match a (list of) pattern(s) that the values in a list (e.g., data frame column) are expected to satisfy.

openclean.profiling.anomalies.pattern.DefaultTokenizer() openclean.function.token.base.Tokenizer

Create an instance of the default tokenizer.

class openclean.profiling.anomalies.pattern.RegExOutliers(patterns: List[str], fullmatch: Optional[bool] = True)

Bases: openclean.profiling.anomalies.conditional.ConditionalOutliers

Identify values in a (list of) data frame columns(s) that do not match any of the given pattern expressions. Patterns are represented as strings in the Python Regular Expression Syntax.

outlier(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) bool

Test if a given value is a match for the associated regular expressions. If the value is not a match it is considered an outlier.

Returns a dictionary for values that are classified as outliers that contains one element ‘value’ for the tested value.

Parameters

value (scalar or tuple) – Value that is being tested for the outlier condition.

Return type

bool

class openclean.profiling.anomalies.pattern.TokenSignatureOutliers(signature: List[Set[str]], tokenizer: Optional[openclean.function.token.base.Tokenizer] = None, exactly_one: Optional[bool] = False)

Bases: openclean.profiling.anomalies.conditional.ConditionalOutliers

Identify values that do not contain at least one token from a given token signature.

Uses a given tokenizer to transform a given value into a set of tokens. Then checks if at least one of the tokens matches one of the entries in a token signature. To match an entry, the token has to be a member of the set of tokens for that entry.

outlier(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) bool

Test if a given value is a match for the associated regular expressions. If the value is not a match it is considered an outlier.

Returns a dictionary for values that are classified as outliers that contains one element ‘value’ for the tested value.

Parameters

value (scalar or tuple) – Value that is being tested for the outlier condition.

Return type

bool

openclean.profiling.anomalies.pattern.regex_outliers(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], patterns: List[str], fullmatch: Optional[bool] = True) List

Identify values in a (list of) data frame columns(s) that do not match any of the given pattern expressions. Patterns are represented as strings in the Python Regular Expression Syntax.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.

  • patterns (list(string)) – List if regular expression patterns.

  • fullmatch (bool, default=True) – If True, the pattern has to match a given string fully in order to not be considered an outlier.

Return type

list