openclean.function.similarity.base module

Base classes for similarity functions and similarity constraints.

class openclean.function.similarity.base.SimilarityConstraint(func: openclean.function.similarity.base.SimilarityFunction, pred: Callable)

Bases: object

Function that validates a constraint, e.g., a threshold predicate, on the similarity between two values (scalar or tuples).

This class is a simple wrapper around a similarity function and a predicate that is evaluated on the similarity score for a given pair of values.

is_satisfied(val_1: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], val_2: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) bool

Test if a given pair of values satisfies the similarity constraint.

Returns True if the similarity between val_1` and val_2 satisfies the constraint (e.g., a given trheshold).

Parameters
  • val_1 (scalar or tuple) –

  • val_2 (scalar or tuple) –

Return type

bool

class openclean.function.similarity.base.SimilarityFunction

Bases: object

Mixin class for functions that compute the similarity between two values (scalar or tuples). Primarily useful for string similarity.

Similarity results are float values in the interval [0-1] where 0 is the minimal similarity between two values and 1 is the maximal similarity.

abstract sim(val_1: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]], val_2: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) float

Compute similarity between between two values.

The result is in the interval [0-1] where 0 is the minimal similarity between two values and 1 is the maximal similarity.

Parameters
  • val_1 (scalar or tuple) –

  • val_2 (scalar or tuple) –

Return type

float