openclean.operator.transform.limit module

Data frame transformer (and stream processor) that can be used to limit the number of rows in a data frame. The limit operator is primarily intended for use in streaming settings. The data frame transformer implementation is included for completeness.

class openclean.operator.transform.limit.Limit(rows: int)

Bases: openclean.operator.stream.processor.StreamProcessor, openclean.operator.base.DataFrameTransformer

Transformer and stream processor that limits the number of rows in a data frame.

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream consumer. Returns an instance of a stream consumer that limits the number of rows that are passed on to a downstream consumer.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.transformer.limit.LimitConsumer

transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Return a data frame that contains at most n of rows (where n equals the row limit that was set when this object was created).

Parameters

df (pd.DataFrame) – Input data frame.

Return type

pd.DataFrame

class openclean.operator.transform.limit.LimitConsumer(columns: List[Union[str, histore.document.schema.Column]], limit: int, consumer: Optional[openclean.operator.stream.consumer.StreamConsumer] = None)

Bases: openclean.operator.stream.consumer.ProducingConsumer

Consumer that limits the number of rows that are passed on to a downstream consumer. Raises a StopIteration error when the maximum number of rows is reached.

handle(rowid: int, row: List[Union[int, float, str, datetime.datetime]]) List[Union[int, float, str, datetime.datetime]]

Pass the row on to the downstream consumer if the row limit has not been reached yet. Otherwise, a StopIteration error is raised.

Parameters
  • rowid (int) – Unique row identifier

  • row (list) – List of values in the row.

Return type

list

openclean.operator.transform.limit.limit(df: pandas.core.frame.DataFrame, rows: int) pandas.core.frame.DataFrame

Limit the number of rows in a data frame. Returns a data frame that contains at most the first n (n=rows) rows from the input data frame.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • rows (int) – Limit on number of rows in the result. Rows are included starting from the first row until either the row limit or end of the data frame is reached (whatever comes first).

Return type

pd.DataFrame