openclean.profiling.dataset module
Helper class that provides added functionality of top of a list of column profiling results.
- class openclean.profiling.dataset.DatasetProfile
Bases:
list
THe dataset profiler provides functionality to access and transform the list of profiling results for columns in a dataset. Expects a list of dictionaries, each dictionary contaiing at least the following information about each column:
minimum value
maximum value
total number of values
total number of non-empty values
datatypes
Additional information includes the distinct number of values with their respective frequency counts.
- add(name: str, stats: Dict)
Add profiler results for a given column to the list.
- Parameters
name (string) – Column name
stats (dict) – Profiling results for the column.
- column(name: Union[int, str]) Dict
Get the profiling results for a given column.
- Parameters
name (int or string) – Name or index position of the referenced column.
- Return type
dict
- minmax(column: Union[int, str]) pandas.core.frame.DataFrame
Get data frame with (min, max)-values for all data types in a given column.
Raises a ValueError if the specified column is unknown.
- Parameters
column (int or string) – Column index of column name.
- Return type
pd.DataFrame
- multitype_columns() openclean.profiling.dataset.DatasetProfile
Get a dataset profiler that only contains information for those columns that have values of more than one raw data type.
- Return type
openclean.profiling.dataset.DatasetProfiler
- profiles() List[Tuple[str, Dict]]
Get a list of (column name, profiling result) tuples for all columns in the dataset.
- Return type
list
- stats() pandas.core.frame.DataFrame
Get a data frame containing the basic statistics for each columns. This includes the column name, the minimum and maximum value, the number of total values, empty values, and (if present) the number of distinct values per column.
- Return type
pd.DataFrame
- types(distinct: Optional[bool] = False) pandas.core.frame.DataFrame
Get a data frame containing type information for all columns that are included in the profiling results. For each column the number of total values for each each datatype that occurs in the dataset is included.
If datatype information is divided into total and distinct counts the user has the option to get the cont of distinct values for each type instead of the total counts by setting the distinct flag to True.
- Parameters
distinct (bool, default=False) – Return type counts for distinct values instead of total counts.
- Return type
pd.DataFrame
- unique_columns() openclean.profiling.dataset.DatasetProfile
Get a dataset profiler that only contains information for those columns that have a uniqueness of 1, i.e., where all values are unique.
- Return type
openclean.profiling.dataset.DatasetProfiler
- class openclean.profiling.dataset.ProfileConsumer(profilers: List[Tuple[int, str, openclean.profiling.base.DataProfiler]])
Bases:
openclean.operator.stream.consumer.StreamConsumer
- close() List[Dict]
Return a list containing the results from each of the profilers.
- Return type
list
- consume(rowid: int, row: List) List
CDispatch extracted columns values to each consumer.
- Parameters
rowid (int) – Unique row identifier
row (list) – List of values in the row.
- class openclean.profiling.dataset.ProfileOperator(profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None)
Bases:
openclean.operator.stream.processor.StreamProcessor
- open(schema: List[Union[str, histore.document.schema.Column]]) openclean.operator.stream.consumer.StreamConsumer
Factory pattern for stream profiling consumers. Creates an instance of a stream profiler for each column that was selected for profiling. If no profilers were specified at object instantiation all columns will be profiled.
- Parameters
schema (list of string) – List of column names in the data stream schema.
- Return type
- class openclean.profiling.dataset.Profiler
Bases:
object
Interface for data profiler that generate metadata for a given data frame.
- abstract profile(df: pandas.core.frame.DataFrame, columns: Optional[Union[int, str, List[Union[str, int]]]] = None) Dict
Run profiler on a given data frame. The structure of the resulting dictionary is implementatin dependent.
TODO: define required components in the result of a data profier.
- Parameters
df (pd.DataFrame) – Input data frame.
columns (int, string, or list(int or string), default=None) – Single column or list of column index positions or column names for those columns that are being profiled. Profile the full dataset if None.
- Return type
dict
- openclean.profiling.dataset.dataset_profile(df: pandas.core.frame.DataFrame, profilers: Optional[Union[int, str, Tuple[Union[int, str], openclean.profiling.base.DataProfiler]]] = None, default_profiler: Optional[Type] = None) openclean.profiling.dataset.DatasetProfile
Profiling operator for profiling one or more columns in a data frame. By default all columns in the data stream are profiled independently using the default column profiler. The optional list of profilers allows to override the default behavior by providing a list of column references and (optional) profiler functions.
- Parameters
profilers (list of tuples of column reference and) – openclean.profiling.base.ProfilingFunction, default=None Specify the list of columns that are profiled and the profiling function. If only a column reference is given (not a tuple) the default profiler is used for profiling the column.
default_profiler (class, default=None) – Class object that is instanciated as the profiler for columns that do not have a profiler instance speicified for them.