openclean.function.matching.fuzzy module

Fuzzy Approximate String Matching

class openclean.function.matching.fuzzy.FuzzySimilarity(vocabulary: Optional[Iterable[str]] = None, gram_size_lower: Optional[int] = 2, gram_size_upper: Optional[int] = 3, use_levenshtein: Optional[bool] = True, rel_sim_cutoff: Optional[float] = 1.0)

Bases: openclean.function.matching.base.StringSimilarity

FuzzySet implementation for the String Similarity class. This is a simple implementation that uses fuzzy string comparisons to do approximate string matching operations against the provided vocabulary.

Note: it converts everything to lowercase and removes all punctuation except commas and spaces.

add(value: str)

Create ngrams from a vocabulary word, calculate L2 norm and store values in in the internal dictionaries

the steps are as such: - n grams are computed. e.g. tokyo -> -tokyo- (add start+end chars) -> -t, to, ok, ky, yo, o- (2grams) - 2 and 3 gram frequencies counted and norms calculated using it. - the norms along with words are stored in self.items for all words in the vocab - the ngrams along with frequencies are in self.match_dict - exact_set stores the lowercased entry:original in a dict and returns

Parameters: value (str) – The vocabulary word to include

compute(value: str, gram_size: int) → List[Tuple[float, str]]

Computes the ngrams from the query string and calculates distances with the vocabulary words to return matches with similarity greater than the threshold.

Parameters

value (str) – the query string
gram_size (int) – the n in n-gram

Return type

List[Tuple[float, str]]

match(vocabulary: Iterable[str], query: str) → List[openclean.data.mapping.StringMatch]

Compute a fuzzy similarity score for a string against items from a vocabulary iterable.

Parameters

vocabulary (Iterable[str]) – List of strings to compare with.
query (string) – Second argument for similarity score computation - the query term.

Return type

list of openclean.data.mapping.StringMatch

search(key: str, default: Union[None, Tuple[float, str]] = None) → List[Tuple[float, str]]

searches for the key for matches or returns the default value

Parameters

key (str) – the query string
default (Optional[None, Tuple[float, str]], default = None) – the default value to return if match not found

openclean.function.matching.fuzzy.gram_counter(value: str, gram_size: int = 2) → dict

Counts the ngrams and their frequency from the given value

Parameters

value (str) – The string to compute the n-grams from
gram_size (int, default= 2) – The n in the n-gram

Return type

dict

openclean.function.matching.fuzzy.gram_iterator(value: str, gram_size: int = 2)

Iterates and yields all the ngrams from the given value

Parameters

value (str) – The string to compute the n-grams from
gram_size (int, default= 2) – The n in the n-gram