openclean.operator.transform.filter module

Functions and classes that implement the filter operators in openclean.

class openclean.operator.transform.filter.Filter(predicate: openclean.function.eval.base.EvalFunction, negated: Optional[bool] = False)

Bases: openclean.operator.stream.processor.StreamProcessor, openclean.operator.base.DataFrameTransformer

Data frame transformer that evaluates a Boolean predicate on the rows of a data frame. The transformed output contains only those rows for which the predicate evaluated to True (or Flase if the negated flag is True).

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamFunctionHandler

Factory pattern for stream consumer. Returns an instance of a stream consumer that filters rows in a data stream using an stream function representing the filter predicate.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.operator.stream.consumer.StreamFunctionHandler

transform(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame

Return a data frame that contains only those rows from the given input data frame that satisfy the filter condition.

Parameters

df (pd.DataFrame) – Input data frame.

Return type

pd.DataFrame

openclean.operator.transform.filter.delete(df: pandas.core.frame.DataFrame, predicate: openclean.function.eval.base.EvalFunction) pandas.core.frame.DataFrame

Delete rows in a data frame. The delete operator evaluates a given predicate on all rows in a data frame. It returns a new data frame where those rows that satisfied the predicate are deleted.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • predicate (openclean.function.eval.base.EvalFunction) – Evaluation function that is expected to return a Boolean value when evaluated on a data frame row. All rows in the input data frame that satisfy the predicate will be deleted.

Return type

pd.DataFrame

openclean.operator.transform.filter.filter(df: pandas.core.frame.DataFrame, predicate: openclean.function.eval.base.EvalFunction, negated: Optional[bool] = False) pandas.core.frame.DataFrame

Filter function for data frames. Returns a data frame that only contains the rows of the input data frame for which the given predicate evaluates to True.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • predicate (openclean.function.eval.base.EvalFunction) – Evaluation function that is expected to return a Boolean value when evaluated on a data frame row. Only those rows in the input data frame that satisfy the predicate will be included in the result.

  • negated (bool, default=False) – Negate the predicate value to get an inverted result.

Return type

pandas.DataFrame

Raises

ValueError