openclean.profiling.dataset module

Helper class that provides added functionality of top of a list of column profiling results.

class openclean.profiling.dataset.DatasetProfile

Bases: list

THe dataset profiler provides functionality to access and transform the list of profiling results for columns in a dataset. Expects a list of dictionaries, each dictionary contaiing at least the following information about each column:

  • minimum value

  • maximum value

  • total number of values

  • total number of non-empty values

  • datatypes

Additional information includes the distinct number of values with their respective frequency counts.

add(name: str, stats: Dict)

Add profiler results for a given column to the list.

Parameters
  • name (string) – Column name

  • stats (dict) – Profiling results for the column.

column(name: Union[int, str]) Dict

Get the profiling results for a given column.

Parameters

name (int or string) – Name or index position of the referenced column.

Return type

dict

minmax(column: Union[int, str]) pandas.core.frame.DataFrame

Get data frame with (min, max)-values for all data types in a given column.

Raises a ValueError if the specified column is unknown.

Parameters

column (int or string) – Column index of column name.

Return type

pd.DataFrame

multitype_columns() openclean.profiling.dataset.DatasetProfile

Get a dataset profiler that only contains information for those columns that have values of more than one raw data type.

Return type

openclean.profiling.dataset.DatasetProfiler

profiles() List[Tuple[str, Dict]]

Get a list of (column name, profiling result) tuples for all columns in the dataset.

Return type

list

stats() pandas.core.frame.DataFrame

Get a data frame containing the basic statistics for each columns. This includes the column name, the minimum and maximum value, the number of total values, empty values, and (if present) the number of distinct values per column.

Return type

pd.DataFrame

types(distinct: Optional[bool] = False) pandas.core.frame.DataFrame

Get a data frame containing type information for all columns that are included in the profiling results. For each column the number of total values for each each datatype that occurs in the dataset is included.

If datatype information is divided into total and distinct counts the user has the option to get the cont of distinct values for each type instead of the total counts by setting the distinct flag to True.

Parameters

distinct (bool, default=False) – Return type counts for distinct values instead of total counts.

Return type

pd.DataFrame

unique_columns() openclean.profiling.dataset.DatasetProfile

Get a dataset profiler that only contains information for those columns that have a uniqueness of 1, i.e., where all values are unique.

Return type

openclean.profiling.dataset.DatasetProfiler

class openclean.profiling.dataset.ProfileConsumer(profilers: List[Tuple[int, str, openclean.profiling.base.DataProfiler]])

Bases: openclean.operator.stream.consumer.StreamConsumer

close() List[Dict]

Return a list containing the results from each of the profilers.

Return type

list

consume(rowid: int, row: List) List

CDispatch extracted columns values to each consumer.

Parameters
  • rowid (int) – Unique row identifier

  • row (list) – List of values in the row.

class openclean.profiling.dataset.ProfileOperator(profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None)

Bases: openclean.operator.stream.processor.StreamProcessor

open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream profiling consumers. Creates an instance of a stream profiler for each column that was selected for profiling. If no profilers were specified at object instantiation all columns will be profiled.

Parameters

schema (list of string) – List of column names in the data stream schema.

Return type

openclean.profiling.dataset.ProfileConsumer

class openclean.profiling.dataset.Profiler

Bases: object

Interface for data profiler that generate metadata for a given data frame.

abstract profile(df: pandas.core.frame.DataFrame, columns: Optional[Union[int, str, List[Union[str, int]]]] = None) Dict

Run profiler on a given data frame. The structure of the resulting dictionary is implementatin dependent.

TODO: define required components in the result of a data profier.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • columns (int, string, or list(int or string), default=None) – Single column or list of column index positions or column names for those columns that are being profiled. Profile the full dataset if None.

Return type

dict

openclean.profiling.dataset.dataset_profile(df: pandas.core.frame.DataFrame, profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None) openclean.profiling.dataset.DatasetProfile

Profiling operator for profiling one or more columns in a data frame. By default all columns in the data stream are profiled independently using the default column profiler. The optional list of profilers allows to override the default behavior by providing a list of column references and (optional) profiler functions.

Parameters
  • profilers (list of tuples of column reference and) – openclean.profiling.base.ProfilingFunction, default=None Specify the list of columns that are profiled and the profiling function. If only a column reference is given (not a tuple) the default profiler is used for profiling the column.

  • default_profiler (class, default=None) – Class object that is instanciated as the profiler for columns that do not have a profiler instance speicified for them.