openclean.function.value.base module

Base class for value function. Collection of basic helper functions.

class openclean.function.value.base.CallableWrapper(func: Callable)

Bases: openclean.function.value.base.PreparedFunction

Wrapper for callable functions as value functions. This value function does not prepare the wrapped callable.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Evaluate the wrapped function on a given value. The value may either be a scalar or a tuple. The return value of the function is dependent on the wrapped function.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

scalar or tuple

class openclean.function.value.base.ConstantValue(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]])

Bases: openclean.function.value.base.PreparedFunction

Value function that returns a given constant value for all inputs.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Return the constant result value.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

any

class openclean.function.value.base.CounterConverter(func: Callable)

Bases: openclean.function.value.base.PreparedFunction

Wrapper for callable functions that are appied on items of a value counter.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Evaluate the wrapped function on a given value.

The value is expected to be a tuple (item from a collection.Counter object) that contains a value and its count. The wrapped callable is applied on the value and a tuple with the modified value and the original count is returned.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

scalar or tuple

class openclean.function.value.base.PreparedFunction

Bases: openclean.function.value.base.ValueFunction

Abstract base class for value functions that do not make use of the prepare method. These functions are considered as initialized and ready to operate without the need for calling the prepare method first.

is_prepared() bool

Instances of this class do not need to be further prepared.

Return type

bool

prepare(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) openclean.function.value.base.ValueFunction

The prepare step is ignored for a wrapped callable.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

class openclean.function.value.base.UnpreparedFunction

Bases: openclean.function.value.base.ValueFunction

Abstract base class for value functions that make use of the prepare method. These functions are expected to return a new instance of a different value function class as the result of the prepare step.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Raise an error if the eval method is called since this indicates that the function has not been prepared.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

scalar or tuple

is_prepared() bool

Returns False because the function required to be prepared.

Return type

bool

class openclean.function.value.base.ValueFunction

Bases: object

The abstract class for value functions defines the interface for methods that need to be implemented for preparing and evaluating the function.

apply(values: Union[List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter], threads: Optional[int] = None) Union[List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]], collections.Counter]

Apply the function to each value in a given set.

Depending on the type of the input, the result is either a list of values that are the result of the eval method for the respective input values or a new counter object where keys are the modified values.

Calls the prepare method before executing the eval method on each individual value in the given list.

Parameters
  • values (list) – List of scalar values or tuples of scalar values.

  • threads (int, default=None) – Number of parallel threads to use for processing. If None the value from the environment variable ‘OPENCLEAN_THREADS’ is used as the default.

Return type

list

abstract eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]

Evaluate the function on a given value. The value may either be a scalar or a tuple. The value will be from the list of values that was passed to the object in the prepare call.

The return value of the function is implementation dependent.

Parameters

value (scalar or tuple) – Value from the list that was used to prepare the function.

Return type

scalar or tuple

abstract is_prepared() bool

Returns True if the prepare method is ignored by an implementation of this function. Containing classes will only call the prepare method for those value functions that are not prepared.

Return type

bool

map(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) Dict

The map function takes a list of values and outputs a dictionary. The keys in the returned dictionary are the distinct values in the input list. The values that are associated with the keys are the result of applying the eval function of this class on the key value.

Parameters

values (list) – List of scalar values or tuples of scalar values.

Return type

dict

abstract prepare(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) openclean.function.value.base.ValueFunction

Optional step to prepare the function for a given set of values. This step allows to compute additional statistics over the set of values.

While it is likely that the given set of values represents the values for which the eval() function will be called, this property is not guaranteed.

Parameters

values (list) – List of scalar values or tuples of scalar values.

Return type

openclean.function.value.base.ValueFunction

openclean.function.value.base.extract(values, label, raise_error=True, default_value=None)

Create a flat dictionary from a nested one. The resulting dictionary contains the same keys as the input dictionary. The associated values are the values from the nested dictionaries under the given label.

If a nested value does not contain the given label as key a KeyError is raised if the raise error flag is True. If the flag is False the given default value is used instead.

Parameters
  • values (dict) – Nested dictionary from which the values with the given label are extracted.

  • label (string) – Label of element for which the metadata array is created.

  • raise_error (bool, default=True) – Raise a KeyError if a nested dictionary value does not contain the given label as a key.

  • default_value (any, default=None) – Default value for values that do not contain the the given label as a key.

Return type

openclean.data,metadata.Feature

Raises

KeyError

openclean.function.value.base.merge(values_1, values_2, labels, join='inner')

Merge two dictionaries. The resulting dictionary will map key values to dictionaries. Each nested dictionary has two elements, representing the values from the respective merged dictionary. The labels for these elements are defined by the labels argument.

The join method allows for four types of merging:

  • inner: Keep only those keys that are in the intersection of both

    dictionaries.

  • outer: Keep all keys from the union of both dictionaries.

  • left-outer: Keep all keys from the first dictionary.

  • right-outer: Keep all keys from the second dictionary.

Raises a ValueError if the number of given labels is not two or if an invalid join method is specified.

Parameters
  • vaues_1 (dict) – Left side of the join.

  • values_2 (dict) – Right side of the join.

  • join (enum['inner', 'outer', 'left-outer', 'right-outer'], default='inner') – Join method identifier.

Return type

dict

Raises

ValueError

openclean.function.value.base.normalize(values, normalizer, keep_original=False, labels=None)

Normalize frequency counts in a given dictionary. Expects a dictionary where keys are mapped to numeric values. Applies the given normalization function on all values. Returns a dictionary where keys are mapped to the normalized values.

If the keep_original flag is True, the original values are also included in the result. In this case, the keys in the resulting dictionary are mapped to dictionaries with two values. The default key values for the nested dictionary values are ‘absolute’ for the original value and ‘normalized’ for the normalized value. These names can be overridden by providing a list or tuple of labels with exactly two elements.

Parameters
  • values (dict) – Dictionary that maps arbitrary key values to numeric values.

  • normalizer (callable or openclean.function.value.base.ValueFunction,) – default=None Normalization function that will be used to normalize the numeric values in the given dictionary.

  • keep_original (bool, default=False) – If the keep original value is set to True, the resulting dictionary will map key values to dictionaries. Each nested dictionary will have two elements, the original (‘absolute’) value and the normalized value.

  • labels (list or tuple, default=('absolute', 'normalized')) – List or tuple with exactly two elements. The labels will only be used if the keep_original flag is True. The first element is the label for the original value in the returned nested dictionary and the second element is the label for the normalized value.

Return type

dict

Raises

ValueError

openclean.function.value.base.to_value_function(arg)

Ensure that a given argument is a ValueFunction. If the arg is callable it will be wrapped. Otherwise, a constant value function is returned.

Parameters

arg (any) – Argument that is tested for being a ValueFunction.

Return type

openclean.function.value.base.ValueFunction