openclean.function.similarity.text module
Collection of string similarity functions.
- class openclean.function.similarity.text.DamerauLevenshteinDistance
Bases:
openclean.function.similarity.text.NormalizedEditDistance
String similarity function that is based on the Damerau-Levenshtein distance between two strings.
- class openclean.function.similarity.text.HammingDistance
Bases:
openclean.function.similarity.text.NormalizedEditDistance
String similarity function that is based on the Hamming distance between two strings.
- class openclean.function.similarity.text.JaroSimilarity
Bases:
openclean.function.similarity.text.StringSimilarityFunction
String similarity function that is based on the Jaro similarity between two strings.
- class openclean.function.similarity.text.JaroWinklerSimilarity
Bases:
openclean.function.similarity.text.StringSimilarityFunction
String similarity function that is based on the Jaro-Winkler distance between two strings.
- class openclean.function.similarity.text.LevenshteinDistance
Bases:
openclean.function.similarity.text.NormalizedEditDistance
String similarity function that is based on the Levenshtein distance between two strings.
- class openclean.function.similarity.text.MatchRatingComparison
Bases:
openclean.function.similarity.base.SimilarityFunction
String similarity function that is based on the match rating algorithm that returns True if two strings are considered equivalent and False otherwise.
To return a value in the interval of [0-1] a match rating result of True is translated to 1 and the result False is translated to 0.
- sim(val_1: str, val_2: str) float
Use Match rating approach to compare the given strings.
Returns 1 if the match rating algorithm coniders the given strings as equivalent and 0 otherwise.
- Parameters
val_1 (string) – Value 1
val_2 (string) – Value 2
- Return type
float
- class openclean.function.similarity.text.NormalizedEditDistance(func: Callable)
Bases:
openclean.function.similarity.base.SimilarityFunction
String similarity function that is based on functions that compute an edit distance between a pair of strings.
The similarity for a pair of strings based on edit distance is the defined as (1 - normalized distance).
- sim(val_1: str, val_2: str) float
Calculates the edit distance between two strings and returns the similarity between them as (1 - normalized distance). The normalized distance is the edit distance divided by the length of the longer of the two strings.
- Parameters
val_1 (string) – Value 1
val_2 (string) – Value 2
- Return type
float
- class openclean.function.similarity.text.StringSimilarityFunction(func: Callable)
Bases:
openclean.function.similarity.base.SimilarityFunction
Wrapper for existing string similarity functions that compute the similarity between a pair of strings as a float in the interval [0-1].
- sim(val_1: str, val_2: str) float
Calculate the similarity beween the given pair of strings.
- Parameters
val_1 (string) – Value 1
val_2 (string) – Value 2
- Return type
float