openclean.profiling.pattern.token_signature module

Token signatures for column type classification and anomaly detection.

Token signatures are sets of tokens that are representative for values in a semantic type. A common example are street addresses. In the U.S., for example, the values in dataset columns that contain street addresses are likely to contain tokens like ‘AVENUE’, ‘ROAD’, ‘STREET’, etc. In many cases alternative abbreviations are possible for these tokens, e.g., ‘STREET’, ‘STRT’, ‘STR’, etc.

A token signature is a list of sets, where each set contains the different possible abbreviations for a token that is part of a representative signature for a semantic type.

When used for column type classification, for example, one would expect that values in a column of the classified type are likely to contain exactly one representation for one of the tokens in the type signature.

openclean.profiling.pattern.token_signature.token_signature(grouping: openclean.data.groupby.DataFrameGrouping, columns: Union[int, str, List[Union[str, int]]], include_key: Optional[bool] = True) → List[Set[str]]

Create a token signature from the specified columns in a data frame grouping.

Each group represents an entry in the returned signature. The set of distinct values from all columns over the rows in the group represent the signature entry with the different token representations.

Parameters

grouping (openclean.data.groupby.DataFrameGrouping) – Grouping of data frame rows.
columns (int, string, or list(int or string)) – Single column or list of column index positions or column names.
include_key (bool, default=True) – Include the key value for each group in the signature entry that is being created for the group.

Return type

openclean.profiling.pattern.token_signature.TokenSignature