openclean.function.matching.fuzzy module
Fuzzy Approximate String Matching
- class openclean.function.matching.fuzzy.FuzzySimilarity(vocabulary: Optional[Iterable[str]] = None, gram_size_lower: Optional[int] = 2, gram_size_upper: Optional[int] = 3, use_levenshtein: Optional[bool] = True, rel_sim_cutoff: Optional[float] = 1.0)
Bases:
openclean.function.matching.base.StringSimilarity
FuzzySet implementation for the String Similarity class. This is a simple implementation that uses fuzzy string comparisons to do approximate string matching operations against the provided vocabulary.
Note: it converts everything to lowercase and removes all punctuation except commas and spaces.
- add(value: str)
Create ngrams from a vocabulary word, calculate L2 norm and store values in in the internal dictionaries
the steps are as such: - n grams are computed. e.g. tokyo -> -tokyo- (add start+end chars) -> -t, to, ok, ky, yo, o- (2grams) - 2 and 3 gram frequencies counted and norms calculated using it. - the norms along with words are stored in self.items for all words in the vocab - the ngrams along with frequencies are in self.match_dict - exact_set stores the lowercased entry:original in a dict and returns
- Parameters
value (str) – The vocabulary word to include
- compute(value: str, gram_size: int) List[Tuple[float, str]]
Computes the ngrams from the query string and calculates distances with the vocabulary words to return matches with similarity greater than the threshold.
- Parameters
value (str) – the query string
gram_size (int) – the n in n-gram
- Return type
List[Tuple[float, str]]
- match(vocabulary: Iterable[str], query: str) List[openclean.data.mapping.StringMatch]
Compute a fuzzy similarity score for a string against items from a vocabulary iterable.
- Parameters
vocabulary (Iterable[str]) – List of strings to compare with.
query (string) – Second argument for similarity score computation - the query term.
- Return type
list of openclean.data.mapping.StringMatch
- search(key: str, default: Union[None, Tuple[float, str]] = None) List[Tuple[float, str]]
searches for the key for matches or returns the default value
- Parameters
key (str) – the query string
default (Optional[None, Tuple[float, str]], default = None) – the default value to return if match not found
- openclean.function.matching.fuzzy.gram_counter(value: str, gram_size: int = 2) dict
Counts the ngrams and their frequency from the given value
- Parameters
value (str) – The string to compute the n-grams from
gram_size (int, default= 2) – The n in the n-gram
- Return type
dict
- openclean.function.matching.fuzzy.gram_iterator(value: str, gram_size: int = 2)
Iterates and yields all the ngrams from the given value
- Parameters
value (str) – The string to compute the n-grams from
gram_size (int, default= 2) – The n in the n-gram