openclean.profiling.anomalies.datatype module

Operators for detecting values in a column that do not match the (expected) data type for the column.

class openclean.profiling.anomalies.datatype.DatatypeOutlierResults(iterable=(), /)

Bases: list

Datatype outlier results are a list of dictionaries. Each dictionary contains information about a detected outlier value (‘value’) and additional metadata (‘metadata’: {‘type’}) about the assigned type label for the outlier value.

This class provides some basic functionality to access the individual pieces of information from these dictionaries.

types() Dict

Get a mapping of outlier types to a list of values of that type.

Return type

dict

Raises

KeyError

values() List

Get only the list of outlier vaues.

Return type

list

class openclean.profiling.anomalies.datatype.DatatypeOutliers(classifier: Callable, domain: Union[int, float, str, datetime.datetime, List[Union[int, float, str, datetime.datetime]]])

Bases: openclean.profiling.anomalies.conditional.ConditionalOutliers

Identify values that do not match the expected data type for a list of values (e.g., a column in a data frame). The expected data type is defined by a set of data type labels. A classifier is used to identify the type of values. Values that are assigned a type that are not included in the set of expected type labels are considered outliers.

outlier(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Dict

Use classifier to get the data type for the given value. If the returned type label is not included in the set of valid type labels the value is considered an outlier.

Returns a dictionary for values that are classified as outliers that contains two elements: ‘value’ and ‘metadata’, containing the tested value and the returned type label (in the ‘metadata’ dictionary with key ‘type’), respectively.

Parameters

value (scalar or tuple) – Value that is being tested for the outlier condition.

Return type

dict

openclean.profiling.anomalies.datatype.datatype_outliers(df: pandas.core.frame.DataFrame, columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], classifier: Callable, domain: Union[int, float, str, datetime.datetime, List[Union[int, float, str, datetime.datetime]]]) Dict

Identify values that do not match the expected data type. The expected data type for a (list of) column(s) is defined by the given domain. The classifier is used to identify the type of data values. Values that are assigned a type that does not belong to the defined domain are considered data type outliers.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (int, string, list, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a a single column reference or a list of column references.

  • classifier (callable) – Classifier that assigns data type class labels to column values.

  • domain (scalar or list) – Valid data type value(s). Defines the types that are not considered outliers.

Return type

dict