openclean.embedding.base module

Operator that creates feature vectors (embeddings) for a list of values. Embeddings are generated by a vector generator that is applied to each element in a given stream of scalar values or tuples of scalar values.

class openclean.embedding.base.Embedding(features)

Bases: object

Compute feature vectors for values in a given stream of scalar values or tuples of scalar values.

exec(values)

Return an array that contains a feature vector for each distinct value in the given input data list. The vector is computed using a list of value feature functions. The resulting array has one column per feature function and one entry per distinct value.

Parameters

values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.

Return type

openclean.embedding.base.FeatureVector

class openclean.embedding.base.FeatureVector

Bases: collections.OrderedDict

Maintain a list values (e.g., the distinct values in a data frame column) together with their feature vectors (e.g., word embeddings).

add(value, vec)
property data

Get numpy array containing all feature vectors. The order of vectors is the same as the order in whch they were added.

Return type

numpy.array

class openclean.embedding.base.ValueEmbedder

Bases: object

Abstract generator class of value (word) embeddings for scalar values. Outputs a feature vector for each value.

abstract embed(value)

Return the embedding vector for a given scalar value.

Parameters

value (scalar) – Scalar value (or tuple) in a data stream.

Return type

numpy.array

abstract prepare(values)

Passes the list of values to the vector generator pre-compute any statistics (e.g., min-max values) that are required.

Parameters

values (iterable) – List of data values.

Return type

openclean.embedding.base.ValueEmbedder

openclean.embedding.base.embedding(df, columns, features)

Compute feature vectors (embeddings) for values in a given (list of) data frame column (s). Computes a feature vector for each value using the given vector generator.

Returns an n-dimensional feature vector where n is the number of features. The array has one row per value in the selected data frame column(s).

Parameters
  • df (pandas.DataFrame) – Input data frame

  • columns (int or string or list(int or string)) – List of column index or column name for columns for which distinct value combinations are computed.

  • features (openclean.profiling.embedding.base.ValueEmbedder) – Generator for feature vectors that computes a vector of numeric values for a given scalar value (or tuple).

Return type

openclean.embedding.base.FeatureVector