openclean.operator.transform.limit module
Data frame transformer (and stream processor) that can be used to limit the number of rows in a data frame. The limit operator is primarily intended for use in streaming settings. The data frame transformer implementation is included for completeness.
- class openclean.operator.transform.limit.Limit(rows: int)
Bases:
openclean.operator.stream.processor.StreamProcessor
,openclean.operator.base.DataFrameTransformer
Transformer and stream processor that limits the number of rows in a data frame.
- open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer
Factory pattern for stream consumer. Returns an instance of a stream consumer that limits the number of rows that are passed on to a downstream consumer.
- Parameters
schema (list of string) – List of column names in the data stream schema.
- Return type
openclean.operator.transformer.limit.LimitConsumer
- transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame
Return a data frame that contains at most n of rows (where n equals the row limit that was set when this object was created).
- Parameters
df (pd.DataFrame) – Input data frame.
- Return type
pd.DataFrame
- class openclean.operator.transform.limit.LimitConsumer(columns: List[Union[str, histore.document.schema.Column]], limit: int, consumer: Optional[openclean.operator.stream.consumer.StreamConsumer] = None)
Bases:
openclean.operator.stream.consumer.ProducingConsumer
Consumer that limits the number of rows that are passed on to a downstream consumer. Raises a StopIteration error when the maximum number of rows is reached.
- handle(rowid: int, row: List[Union[int, float, str, datetime.datetime]]) List[Union[int, float, str, datetime.datetime]]
Pass the row on to the downstream consumer if the row limit has not been reached yet. Otherwise, a StopIteration error is raised.
- Parameters
rowid (int) – Unique row identifier
row (list) – List of values in the row.
- Return type
list
- openclean.operator.transform.limit.limit(df: pandas.core.frame.DataFrame, rows: int) pandas.core.frame.DataFrame
Limit the number of rows in a data frame. Returns a data frame that contains at most the first n (n=rows) rows from the input data frame.
- Parameters
df (pd.DataFrame) – Input data frame.
rows (int) – Limit on number of rows in the result. Rows are included starting from the first row until either the row limit or end of the data frame is reached (whatever comes first).
- Return type
pd.DataFrame