openclean.embedding.base module
Operator that creates feature vectors (embeddings) for a list of values. Embeddings are generated by a vector generator that is applied to each element in a given stream of scalar values or tuples of scalar values.
- class openclean.embedding.base.Embedding(features)
Bases:
object
Compute feature vectors for values in a given stream of scalar values or tuples of scalar values.
- exec(values)
Return an array that contains a feature vector for each distinct value in the given input data list. The vector is computed using a list of value feature functions. The resulting array has one column per feature function and one entry per distinct value.
- Parameters
values (dict) – Set of distinct scalar values or tuples of scalar values that are mapped to their respective frequency count.
- Return type
- class openclean.embedding.base.FeatureVector
Bases:
collections.OrderedDict
Maintain a list values (e.g., the distinct values in a data frame column) together with their feature vectors (e.g., word embeddings).
- add(value, vec)
- property data
Get numpy array containing all feature vectors. The order of vectors is the same as the order in whch they were added.
- Return type
numpy.array
- class openclean.embedding.base.ValueEmbedder
Bases:
object
Abstract generator class of value (word) embeddings for scalar values. Outputs a feature vector for each value.
- abstract embed(value)
Return the embedding vector for a given scalar value.
- Parameters
value (scalar) – Scalar value (or tuple) in a data stream.
- Return type
numpy.array
- abstract prepare(values)
Passes the list of values to the vector generator pre-compute any statistics (e.g., min-max values) that are required.
- Parameters
values (iterable) – List of data values.
- Return type
- openclean.embedding.base.embedding(df, columns, features)
Compute feature vectors (embeddings) for values in a given (list of) data frame column (s). Computes a feature vector for each value using the given vector generator.
Returns an n-dimensional feature vector where n is the number of features. The array has one row per value in the selected data frame column(s).
- Parameters
df (pandas.DataFrame) – Input data frame
columns (int or string or list(int or string)) – List of column index or column name for columns for which distinct value combinations are computed.
features (openclean.profiling.embedding.base.ValueEmbedder) – Generator for feature vectors that computes a vector of numeric values for a given scalar value (or tuple).
- Return type