openclean.data.sequence module

Many operators in openclean operate on a sequence of scalar values or tuples of schalar values. Sequences are it represented by iterators in Python (e.g., list). This module contains a factory pattern for creating iterators over a single column or a set of columns in a pandas data frame.

class openclean.data.sequence.Sequence(df, columns)

Bases: object

Factory pattern for a lists of values from a single data frame column or tuples from a list of columns.

The main reason for having a separate sequence class for pandas data frames is to have a wrapper that supports reference to columns by name or index and that supports iteration over tuples from multiple columns. The sequence class is also capable to handle data frames with duplicate column names.

openclean.data.sequence.multi_column_iterator(df, colidx)

Iterator over values in multiple columns in a data frame.

Parameters
  • df (pandas.DataFrame) – Pandas data frame.

  • colidx – List of indexes for column in the data frame.

Return type

tuple

openclean.data.sequence.single_column_iterator(df, colidx)

Iterator over values in a single data frame column.

Parameters
  • df (pandas.DataFrame) – Pandas data frame.

  • colidx – Index of the data frame column.

Return type

scalar