openclean.operator.stream.sample module

Implementation of stream operators that collect a random sampe of rows from a data stream.

class openclean.operator.stream.sample.Sample(n: int, random_state: Optional[int] = None)

Bases: openclean.operator.stream.processor.StreamProcessor

Processor that defines a random sampling operator in a data stream.

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream consumer. Returns an instance of the random sample generator in a data pipeline.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamConsumer

class openclean.operator.stream.sample.SampleCollector(columns: List[Union[str, histore.document.schema.Column]], n: int, random_state: Optional[int] = None, consumer: Optional[openclean.operator.stream.consumer.StreamConsumer] = None)

Bases: openclean.operator.stream.consumer.ProducingConsumer

Collect a random sample of rows from the data stream and pass them to a downstream consumer. Implements random sampling without replacement using the Reservoir Sampling algorithm. See:

Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.

and

Lahiri B., Tirthapura S. (2009) Stream Sampling. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_372

for details.

Maintains a row buffer with n rows. Pushes the final row set to the downstream consumer at the end of the stream.

close() Any

Pass the selected sample to the connected downstream consumer. Returns the consumer result.

Collect a modified list of rows. Returns the result of the downstream consumer or the collected results (if the consumer result is None).

Return type

any

consume(rowid: int, row: List[Union[int, float, str, datetime.datetime]])

Randomly add the given (rowid, row)-pair to the internal buffer.

Parameters
  • rowid (int) – Unique row identifier

  • row (list) – List of values in the row.

handle(rowid: int, row: List[Union[int, float, str, datetime.datetime]])

Add the given row to the buffer if the maximum buffer size has not been reached yet or the row is randomly selected for inclusion in the sample.

Parameters
  • rowid (int) – Unique row identifier

  • row (list) – List of values in the row.

Return type

list