openclean.profiling.dataset module

Helper class that provides added functionality of top of a list of column profiling results.

class openclean.profiling.dataset.DatasetProfile

Bases: list

THe dataset profiler provides functionality to access and transform the list of profiling results for columns in a dataset. Expects a list of dictionaries, each dictionary contaiing at least the following information about each column:

minimum value
maximum value
total number of values
total number of non-empty values
datatypes

Additional information includes the distinct number of values with their respective frequency counts.

add(name: str, stats: Dict)

Add profiler results for a given column to the list.

Parameters

name (string) – Column name
stats (dict) – Profiling results for the column.

column(name: Union[int, str]) → Dict

Get the profiling results for a given column.

Parameters: name (int or string) – Name or index position of the referenced column.
Return type: dict

minmax(column: Union[int, str]) → pandas.core.frame.DataFrame

Get data frame with (min, max)-values for all data types in a given column.

Raises a ValueError if the specified column is unknown.

Parameters: column (int or string) – Column index of column name.
Return type: pd.DataFrame

multitype_columns() → openclean.profiling.dataset.DatasetProfile

Get a dataset profiler that only contains information for those columns that have values of more than one raw data type.

Return type: openclean.profiling.dataset.DatasetProfiler

profiles() → List[Tuple[str, Dict]]

Get a list of (column name, profiling result) tuples for all columns in the dataset.

Return type: list

stats() → pandas.core.frame.DataFrame

Get a data frame containing the basic statistics for each columns. This includes the column name, the minimum and maximum value, the number of total values, empty values, and (if present) the number of distinct values per column.

Return type: pd.DataFrame

types(distinct: Optional[bool] = False) → pandas.core.frame.DataFrame

Get a data frame containing type information for all columns that are included in the profiling results. For each column the number of total values for each each datatype that occurs in the dataset is included.

If datatype information is divided into total and distinct counts the user has the option to get the cont of distinct values for each type instead of the total counts by setting the distinct flag to True.

Parameters: distinct (bool, default=False) – Return type counts for distinct values instead of total counts.
Return type: pd.DataFrame

unique_columns() → openclean.profiling.dataset.DatasetProfile

Get a dataset profiler that only contains information for those columns that have a uniqueness of 1, i.e., where all values are unique.

Return type: openclean.profiling.dataset.DatasetProfiler

class openclean.profiling.dataset.ProfileConsumer(profilers: List[Tuple[int, str, openclean.profiling.base.DataProfiler]])

Bases: openclean.operator.stream.consumer.StreamConsumer

close() → List[Dict]

Return a list containing the results from each of the profilers.

Return type: list

consume(rowid: int, row: List) → List

CDispatch extracted columns values to each consumer.

Parameters

rowid (int) – Unique row identifier
row (list) – List of values in the row.

class openclean.profiling.dataset.ProfileOperator(profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None)

Bases: openclean.operator.stream.processor.StreamProcessor

open(schema: List[Union[str, histore.document.schema.Column]]) → openclean.operator.stream.consumer.StreamConsumer

Factory pattern for stream profiling consumers. Creates an instance of a stream profiler for each column that was selected for profiling. If no profilers were specified at object instantiation all columns will be profiled.

Parameters: schema (list of string) – List of column names in the data stream schema.
Return type: openclean.profiling.dataset.ProfileConsumer

class openclean.profiling.dataset.Profiler

Bases: object

Interface for data profiler that generate metadata for a given data frame.

abstract profile(df: pandas.core.frame.DataFrame, columns: Optional[Union[int, str, List[Union[str, int]]]] = None) → Dict

Run profiler on a given data frame. The structure of the resulting dictionary is implementatin dependent.

TODO: define required components in the result of a data profier.

Parameters

df (pd.DataFrame) – Input data frame.
columns (int, string, or list(int or string), default=None) – Single column or list of column index positions or column names for those columns that are being profiled. Profile the full dataset if None.

Return type

dict

openclean.profiling.dataset.dataset_profile(df: pandas.core.frame.DataFrame, profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None) → openclean.profiling.dataset.DatasetProfile

Profiling operator for profiling one or more columns in a data frame. By default all columns in the data stream are profiled independently using the default column profiler. The optional list of profilers allows to override the default behavior by providing a list of column references and (optional) profiler functions.

Parameters

profilers (list of tuples of column reference and) – openclean.profiling.base.ProfilingFunction, default=None Specify the list of columns that are profiled and the profiling function. If only a column reference is given (not a tuple) the default profiler is used for profiling the column.
default_profiler (class, default=None) – Class object that is instanciated as the profiler for columns that do not have a profiler instance speicified for them.