openclean.profiling.classifier.typepicker module

Type picker select one or more class labels based on statistics about the frequency of occurrence for each label in a set of data values.

The classes in this module implement different strategies for assigning a datatype to a list of values (e.g., a column in a data frame).

class openclean.profiling.classifier.typepicker.MajorityTypePicker(classifier=None, threshold=0, use_total_counts=False, at_most_one=False)

Bases: openclean.profiling.base.DistinctSetProfiler

Pick the most frequent type assigned by a given classifier to the values in a given list. Generates a dictionary containing the most frequent type(s) as key(s) and their normalized frequency as the associated value.

The majority of a type may be defined based on the distinct values in the given list or the absolute value counts. Allows to further restrict the choice by requiring the frequency of the selected type to be above a given threshold.

process(values)

Select one or more type labels based on data type statistics that are computed over the given list of values using the associated classifier.

Returns a dictionary where the key(s) are the selected type(s) and the values are normalized type frequencies (using divide_by_total). If no type satisfies the associated threshold or more than one type does but the ensure single type flag is True, the result is an empty dictionary.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

list

class openclean.profiling.classifier.typepicker.ThresholdTypePicker(classifier=None, threshold=0, use_total_counts=False)

Bases: openclean.profiling.base.DistinctSetProfiler

Identify all types assigned by a given classifier to the values in a list having a frequency that exceeds a specified threshold. Generates a dictionary containing the types as keys and their normalized frequency as the associated value.

The frequency of a type may be computed based on the distinct values in the given list or the absolute value counts.

process(values)

Select one or more type labels based on data type statistics that are computed over the given list of values using the associated classifier.

Returns a dictionary where the key(s) are the selected type(s) and the values are normalized type frequencies (using divide_by_total). If no type satisfies the associated threshold or more than one type does but the ensure single type flag is True, the result is an empty dictionary.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

list

openclean.profiling.classifier.typepicker.majority_typepicker(df, columns=None, classifier=None, threshold=0, use_total_counts=False, at_most_one=False)

Pick the most frequent type assigned by a given classifier to the values in a given list. Generates a dictionary containing the most frequent type(s) as key(s) and their normalized frequency as the associated value.

The majority of a type may be defined based on the distinct values in the given list or the absolute value counts. Allows to further restrict the choice by requiring the frequency of the selected type to be above a given threshold.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.

  • classifier (openclean.function.value.classifier.ValueClassifier) – , default=None Classifier that assigns data type class labels for scalar column values.

  • threshold (callable or int or float, default=0) – Callable predicate or numeric value that is used to constrain the possible candidates based on their normalized frequency.

  • use_total_counts (bool, default=False) – Use total value counst instead of distinct counts to compute type frequencies.

  • at_most_one (bool, default=False) – Ensure that at most one data type is returned in the result. If the flag is True and multiple types have the maximum frequency, an empty dictionary will be returned.

Return type

dict

openclean.profiling.classifier.typepicker.threshold_typepicker(df, columns=None, classifier=None, threshold=0, use_total_counts=False)

Identify all types assigned by a given classifier to the values in a list having a frequency that exceeds a specified threshold. Generates a dictionary containing the types as keys and their normalized frequency as the associated value.

The frequency of a type may be computed based on the distinct values in the given list or the absolute value counts.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • columns (list, tuple, or openclean.function.eval.base.EvalFunction) – Evaluation function to extract values from data frame rows. This can also be a list or tuple of evaluation functions or a list of column names or index positions.

  • classifier (openclean.function.value.classifier.ValueClassifier) – , default=None Classifier that assigns data type class labels for scalar column values.

  • threshold (callable or int or float, default=0) – Callable predicate or numeric value that is used to constrain the possible candidates based on their normalized frequency.

  • use_total_counts (bool, default=False) – Use total value counst instead of distinct counts to compute type frequencies.