openclean.operator.collector.count module

Funciton to compute set of distinct values and their frequencies from a list of values.

openclean.operator.collector.count.count(df: pandas.core.frame.DataFrame, predicate: Optional[openclean.function.eval.base.EvalFunction] = None, truth_value: Optional[Union[int, float, str, datetime.datetime]] = True) int

Count the number of rows in a data frame. If the optional predicate is given, the rows that satisy the predicate is counted.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • predicate (openclean.function.eval.base.EvalFunction, default=None) – Predicate that is evaluated over te rows in the data frame.

  • truth_value (scalar, defaut=True) – Count the occurrence of the truth value when evaluating the predicate on a the data frame rows.

Return type

int

openclean.operator.collector.count.distinct(df: pandas.core.frame.DataFrame, columns: Optional[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction, List[Union[int, str, histore.document.schema.Column, openclean.function.eval.base.EvalFunction]]]] = None, normalizer: Optional[Union[Callable, openclean.function.value.base.ValueFunction]] = None, keep_original: Optional[bool] = False, labels: Optional[Union[List[str], Tuple[str, str]]] = None) collections.Counter

Compute the set of distinct value combinations for a single columns, a given list of columns, or the list of values returned by a given evaluation function. Returns a Counter containing the distinct values (tuples in case of multiple input columns) together with their frequency counts.

If the optional normalization function is given, the frequency counts in the returned dictionary will be normalized. If the keep original flag is True, the returned dictionary will map key values to nested dictionaries that contain the original and the normalized value.

Parameters
  • df (pandas.DataFrame) – Input data frame.

  • columns (int, string, list, or openclean.function.eval.base.EvalFunction,) – default=None Evaluation function to extract values from data frame rows. This can also be a a single column reference or a list of column references. If not given the distinct number of rows is counted.

  • normalizer (callable or openclean.function.value.base.ValueFunction,) – default=None Optional normalization function that will be used to normalize the frequency counts in the returned dictionary.

  • keep_original (bool, default=False) – If the keep original value is set to True, the resulting dictionary will map key values to dictionaries. Each nested dictionary will have two elements, the original (‘absolute’) value and the normalized value.

  • labels (list or tuple, default=('absolute', 'normalized')) – List or tuple with exactly two elements. The labels will only be used if the keep_original flag is True. The first element is the label for the original value in the returned nested dictionary and the second element is the label for the normalized value.

Return type

dict