openclean.function.eval.domain module

Predicates that test for containment of column values in value sets.

class openclean.function.eval.domain.IsIn(columns, domain, ignore_case=False)

Bases: openclean.function.eval.base.Eval

Boolean predicate to tests whether a value (or list of values) belong(s) to a domain of known values.

class openclean.function.eval.domain.IsNotIn(columns, domain, ignore_case=False)

Bases: openclean.function.eval.base.Eval

Boolean predicate that tests whether a value (or list of values) dos not belong to a domain of knwon values.

class openclean.function.eval.domain.Lookup(columns: Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]], mapping: Dict, default: Optional[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]]] = None)

Bases: openclean.function.eval.base.EvalFunction

A Lookup table is a mapping function. For a given lookup value the result is the mapped value from a given dictionary if a mapping exists. Otherwise, the returned value is generated from a default value function. If the default value function is not defined then the input value is returned as the result.

The aim of having default as a evaluation function is to enable lookups of values in one column using an incomplete lookup table but updating the values a separate column (other than the lookup column). In this case, the lookup value is not the default value.

eval(df: pandas.core.frame.DataFrame) Union[pandas.core.series.Series, List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]]

Evaluate the consumer on the lists of values that are generated by the referenced columns.

Parameters

df (pd.DataFrame) – Pandas data frame.

Return type

pd.Series or list

prepare(columns: List[Union[str, histore.document.schema.Column]]) Callable[[List[Union[int, float, str, datetime.datetime]]], Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]

Prepare the evaluation function to be able to process rows in a data stream. This method is called before streaming starts to inform the function about the schema of the rows in the data stream.

Prepare is expected to return a callable that accepts a single data stream row as input and that returns a single value (if the function operates on a single column) or a tuple of values (for functions that operate over multiple columns).

Parameters

columns (list of string) – List of column names in the schema of the data stream.

Return type

openclean.data.stream.base.StreamFunction