openclean.profiling.pattern.base module

Abstract base class for pattern discovery operators.

class openclean.profiling.pattern.base.Pattern

Bases: object

Interface for objects representing patterns, e.g., a regular expression, that was discovered by a pattern finder. Implementations maintain a representation of the pattern itself as well as any additional metadata that was generated during the discovery process.

abstract compile(negate=False, generator=None)

Get an instance of a value function that is predicate which can be used to test whether an given value is accepted by the pattern or not.

Parameters
  • negate (bool, default=False) – If the negate flag is True, the returned predicate should return True for values that are not accepeted by the pattern and False for those that are accepeted.

  • generator (PatternFinder (optional)) – The patternfinder used to generate the original pattern. required to recreate the tokenization and type detection on the new values

Return type

openclean.function.value.base.ValueFunction

abstract metadata()

Return a dictionary containing optional metadata associated with the pattern. This can for example be statistics generated by the pattern discovery algorithm providing additional information or evidence for the confidence that the algorithm has in the pattern or the relevance of the pattern.

The structure of the dictionary is implementation-dependent. If no additional metadata was generated an empty dictionary should be returned.

Return type

dict

abstract pattern()

Get a string representation of the pattern for display purposes.

Return type

string

abstract to_dict()

Returns a dictionary serialization of the pattern. This is an external representation that is used when the results of a pattern finder are included in the result generated by a data profiler.

Return type

dict

class openclean.profiling.pattern.base.PatternFinder

Bases: openclean.profiling.base.DistinctSetProfiler

Interface for generic regular expression discovery. Each implementation should take an interable of (distinct) values (e.g., from a column in a data frame) as their input. The result is a (list of) string(s) that each represent a regular expression.

exec(values)

This method is executed when the pattern finder is used as part of a data profiler. It returns a list with dictionary serializations for the patterns that are discovered by the find method.

Parameters

values (list) – List of scalar values or tuples of scalar values.

Return type

list

abstract find(values)

Discover patterns like regular expressions in a given sequence of (distinct) values. Returns a list of objects representing the discovered patterns.

Parameters

values (list) – List of scalar values or tuples of scalar values.

Return type

list(openclean.profiling.pattern.base.Pattern)