openclean.operator.stream.sample module
Implementation of stream operators that collect a random sampe of rows from a data stream.
- class openclean.operator.stream.sample.Sample(n: int, random_state: Optional[int] = None)
Bases:
openclean.operator.stream.processor.StreamProcessor
Processor that defines a random sampling operator in a data stream.
- open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer
Factory pattern for stream consumer. Returns an instance of the random sample generator in a data pipeline.
- Parameters
schema (list of string) – List of column names in the data stream schema.
- Return type
- class openclean.operator.stream.sample.SampleCollector(columns: List[Union[str, histore.document.schema.Column]], n: int, random_state: Optional[int] = None, consumer: Optional[openclean.operator.stream.consumer.StreamConsumer] = None)
Bases:
openclean.operator.stream.consumer.ProducingConsumer
Collect a random sample of rows from the data stream and pass them to a downstream consumer. Implements random sampling without replacement using the Reservoir Sampling algorithm. See:
Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.
and
Lahiri B., Tirthapura S. (2009) Stream Sampling. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_372
for details.
Maintains a row buffer with n rows. Pushes the final row set to the downstream consumer at the end of the stream.
- close() Any
Pass the selected sample to the connected downstream consumer. Returns the consumer result.
Collect a modified list of rows. Returns the result of the downstream consumer or the collected results (if the consumer result is None).
- Return type
any
- consume(rowid: int, row: List[Union[int, float, str, datetime.datetime]])
Randomly add the given (rowid, row)-pair to the internal buffer.
- Parameters
rowid (int) – Unique row identifier
row (list) – List of values in the row.
- handle(rowid: int, row: List[Union[int, float, str, datetime.datetime]])
Add the given row to the buffer if the maximum buffer size has not been reached yet or the row is randomly selected for inclusion in the sample.
- Parameters
rowid (int) – Unique row identifier
row (list) – List of values in the row.
- Return type
list