openclean.function.matching.base module

Base classes and types for string matching functions.

class openclean.function.matching.base.DefaultStringMatcher(vocabulary: Iterable[str], similarity: openclean.function.matching.base.StringSimilarity, best_matches_only: Optional[bool] = True, no_match_threshold: Optional[float] = 0.0, cache_results: Optional[bool] = True)

Bases: openclean.function.matching.base.StringMatcher

Default implementation for the string matcher. This is a simple implementation that naively computes the similarity between a query string and every string in the associated vocabulary by letting the string similarity object deal with the vocabulary directly.

The default matcher allows the user to control the list of returned matches via two configuration parameters:

  • best_matches_only: If this flag is True only those matches that have the highest score will be returned in the result. If the flag is True all matches with a score greater than 0 (or the no_match_threshold, see below) are returned.

  • no_match_threshold: Score threshold that controls when a similarity score is considered a non-match.

By default, the vocabulary matcher caches the results for found matches to avoid computing matches for the same query value twice. Caching can be disabled using the cache_results flag.

find_matches(query: str) List[openclean.data.mapping.StringMatch]

Find matches for a given query string in the associated vocabulary. Depending on the implementation the result may contain more than one matched string from the vocabulary. Each match is a pair of matched values and match score.

If no matches are found for a given query string the result is an empty list.

Parameters

query (string) – Query string for which matches are returned.

Return type

list of openclean.data.mapping.StringMatch

class openclean.function.matching.base.ExactSimilarity(transformer: typing.Optional[typing.Callable] = <function scalar_pass_through>, ignore_case: typing.Optional[bool] = False)

Bases: openclean.function.matching.base.StringSimilarity

Implementation of the string similarity class that performs exact matches for string arguments. Allows to transform values before comparing them using a simple callable function that expects a single argument.

The returned score is one for identical string and 0 for non-identical strings. The ignore_case flag allows to compare two strings ignoring their case.

match(vocabulary: Iterable[str], query: str) List[openclean.data.mapping.StringMatch]

Cross reference query with the vocabulary strings for equality. Returns an exact match if the given arguments are the same and a NoMatch otherwise.

Parameters
  • vocabulary (Iterable[str]) – List of strings to compare with.

  • query (string) – Second argument for similarity score computation - the query term.

Return type

list of openclean.data.mapping.StringMatch

class openclean.function.matching.base.StringMatcher(terms: Iterable[str])

Bases: object

Abstract base class for functions that find matches for a query string in a given vocabulary (iterable of strings). Instances of this class are associated with a vocabulary. They return one or more matches from that vocabulary for a given query string.

abstract find_matches(query: str) List[openclean.data.mapping.StringMatch]

Find matches for a given query string in the associated vocabulary. Depending on the implementation the result may contain more than one matched string from the vocabulary. Each match is a pair of matched values and match score.

Matches are sorted by decreasing similarity score. If no matches are found for a given query string the result is an empty list.

Parameters

query (string) – Query string for which matches are returned.

Return type

list of (string, float) pairs

matched_values(query: str) List[str]

Get only a list of matched values for a given query string. Excludes information about the match scores.

Parameters

query (string) – Query string for which matches are returned.

Return type

list of string

class openclean.function.matching.base.StringSimilarity

Bases: object

Abstract base class for functions that compute similarity scores between a list of terms and a query string and return a list of StringMatch results. String similarity scores should be values in the interval [0-1] where 0 indicates no match and 1 indicates an exact match.

abstract match(vocabulary: Iterable[str], query: str) List[openclean.data.mapping.StringMatch]

Compute a similarity score for a string against items from a vocabulary iterable. A score of 1 indicates an exact match. A score of 0 indicates a no match.

Parameters
  • vocabulary (Iterable[str]) – List of strings to compare with.

  • query (string) – Second argument for similarity score computation - the query term.

Return type

list of openclean.data.mapping.StringMatch

score(vocabulary: Iterable[str], query: str) List[openclean.data.mapping.StringMatch]

Synonym for the match function. Compute a similarity score for a string against items from a vocabulary iterable. A score of 1 indicates an exact match. A score of 0 indicates a no match.

Parameters
  • vocabulary (Iterable[str]) – List of strings to compare with.

  • query (string) – Second argument for similarity score computation - the query term.

Return type

list of openclean.data.mapping.StringMatch

openclean.function.matching.base.best_matches(values: Iterable[str], matcher: openclean.function.matching.base.StringMatcher, include_vocab: Optional[bool] = False) openclean.data.mapping.Mapping

Generate a mapping of best matches for a list of values. For each value in the given list the best matches with a given vocabulary are computed and added to the returned mapping.

If the include_vocab flag is False the resulting mapping will contain a mapping only for those values in the input list that do not already occur in the vocabulary, i.e., the unknown values with respect to the known vocabulary.

Parameters
  • values (iterable of strings) – List of terms (e.g., from a data frame column) for which matches are computed for the returned mapping.

  • matcher (openclean.function.matching.base.StringMatcher) – Matcher to compute matches for the terms in a controlled vocabulary.

  • include_vocab (bool, default=False) – If this flag is False the resulting mapping will only contain matches for terms that are not in the vocabulary that is associated with the given similarity.

Return type

openclean.data.mapping.Mapping