openclean.function.similarity.text module

Collection of string similarity functions.

class openclean.function.similarity.text.DamerauLevenshteinDistance

Bases: openclean.function.similarity.text.NormalizedEditDistance

String similarity function that is based on the Damerau-Levenshtein distance between two strings.

class openclean.function.similarity.text.HammingDistance

Bases: openclean.function.similarity.text.NormalizedEditDistance

String similarity function that is based on the Hamming distance between two strings.

class openclean.function.similarity.text.JaroSimilarity

Bases: openclean.function.similarity.text.StringSimilarityFunction

String similarity function that is based on the Jaro similarity between two strings.

class openclean.function.similarity.text.JaroWinklerSimilarity

Bases: openclean.function.similarity.text.StringSimilarityFunction

String similarity function that is based on the Jaro-Winkler distance between two strings.

class openclean.function.similarity.text.LevenshteinDistance

Bases: openclean.function.similarity.text.NormalizedEditDistance

String similarity function that is based on the Levenshtein distance between two strings.

class openclean.function.similarity.text.MatchRatingComparison

Bases: openclean.function.similarity.base.SimilarityFunction

String similarity function that is based on the match rating algorithm that returns True if two strings are considered equivalent and False otherwise.

To return a value in the interval of [0-1] a match rating result of True is translated to 1 and the result False is translated to 0.

sim(val_1: str, val_2: str) float

Use Match rating approach to compare the given strings.

Returns 1 if the match rating algorithm coniders the given strings as equivalent and 0 otherwise.

Parameters
  • val_1 (string) – Value 1

  • val_2 (string) – Value 2

Return type

float

class openclean.function.similarity.text.NormalizedEditDistance(func: Callable)

Bases: openclean.function.similarity.base.SimilarityFunction

String similarity function that is based on functions that compute an edit distance between a pair of strings.

The similarity for a pair of strings based on edit distance is the defined as (1 - normalized distance).

sim(val_1: str, val_2: str) float

Calculates the edit distance between two strings and returns the similarity between them as (1 - normalized distance). The normalized distance is the edit distance divided by the length of the longer of the two strings.

Parameters
  • val_1 (string) – Value 1

  • val_2 (string) – Value 2

Return type

float

class openclean.function.similarity.text.StringSimilarityFunction(func: Callable)

Bases: openclean.function.similarity.base.SimilarityFunction

Wrapper for existing string similarity functions that compute the similarity between a pair of strings as a float in the interval [0-1].

sim(val_1: str, val_2: str) float

Calculate the similarity beween the given pair of strings.

Parameters
  • val_1 (string) – Value 1

  • val_2 (string) – Value 2

Return type

float