openclean.operator.stream.matching module

class openclean.operator.stream.matching.BestMatches(matcher: openclean.function.matching.base.StringMatcher, include_vocab: Optional[bool] = False, mapping: Optional[openclean.data.mapping.Mapping] = None)

Bases: openclean.operator.stream.consumer.StreamConsumer, openclean.operator.stream.processor.StreamProcessor

close() openclean.data.mapping.Mapping

Return the collected mapping at the end of the stream.

Return type

openclean.data.mapping.Mapping

consume(rowid: int, row: List[Union[int, float, str, datetime.datetime]]) List[Union[int, float, str, datetime.datetime]]

Consume the given row. Assumes that the row contains exactly one column.

Parameters
  • values (iterable of strings) – List of terms (e.g., from a data frame column) for which matches are computed for the returned mapping.

  • matcher (openclean.function.matching.base.VocabularyMatcher) – Matcher to compute matches for the terms in a controlled vocabulary.

  • include_vocab (bool, default=False) – If this flag is False the resulting mapping will only contain matches for terms that are not in the vocabulary that is associated with the given matcher.

Return type

openclean.data.mapping.Mapping

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream consumer. Returns an instance of the best matches consumer.

Raises a ValueError if the given schema contains more than one column.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamConsumer

Raises

ValueError