openclean.embedding.feature.default module

Default feature embedding for strings.

class openclean.embedding.feature.default.StandardEmbedding

Bases: openclean.embedding.feature.base.FeatureEmbedding

Instance of the feature embedding function that uses a default set of seven value features to compute feature vectors. The computed features are: - normalized value length - normalized value frequency - uniqueness of characters in the value string - fraction of letter characters in the value string - fraction of digits in the value string - fraction of speical characters in the value string (not digit, letter, or

whitespace)

  • fraction of whitespace characters in the value string

class openclean.embedding.feature.default.UniqueSetEmbedding

Bases: openclean.embedding.feature.base.FeatureEmbedding

Instance of the feature embedding function for nique value stes. This embedding ignores value frequencies. It uses a set of six value features to compute feature vectors. The computed features are: - normalized value length - uniqueness of characters in the value string - fraction of letter characters in the value string - fraction of digits in the value string - fraction of speical characters in the value string (not digit, letter, or

whitespace)

  • fraction of whitespace characters in the value string