openclean.function.token.base module

Interfaces for string tokenizer and token set transformers.

class openclean.function.token.base.CapitalizeTokens

Bases: openclean.function.token.base.UpdateTokens

Capitalize all tokens in a given list.

class openclean.function.token.base.LowerTokens

Bases: openclean.function.token.base.UpdateTokens

Convert all tokens in a given list to lower case.

class openclean.function.token.base.ReverseTokens

Bases: openclean.function.token.base.TokenTransformer

Reverse a given list of string tokens.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Return a reversed copy of the token list.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.SortTokens(key: Optional[Callable] = None, reverse: Optional[bool] = False)

Bases: openclean.function.token.base.TokenTransformer

Sort a given token list in ascending or descending order.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Returns a sorted copy of the tken list.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.StandardizeTokens(mapping: Union[Dict, openclean.function.value.mapping.Standardize])

Bases: openclean.function.token.base.UpdateTokens

Standardize tokens in a given list using a stamdardization mapping.

class openclean.function.token.base.Token(value: str, token_type: Optional[str] = None, rowidx: Optional[int] = None)

Bases: str

Tokens are strings that have an optional (semantic) type label.

The values for type labels are not constraint. It is good practice, to use all upper case values for token types. The default token type is ‘ANY’.

This implementation is based on: https://bytes.com/topic/python/answers/32098-my-experiences-subclassing-string

The order of creation is that the __new__ method is called which returns the object then __init__ is called.

property regex_type: str

Synonym for getting the token type.

Return type

str

property size: int

Synonym to get the length of the token.

Return type

int

to_tuple() Tuple[str, str, int]

Returns a tuple of the string, type and value size.

Return type

tuple of string, string, int

type() str

Get token type value.

This is a wrapper around the token_type property. Returns the default token type ‘ANY’ if no type was given when the object was created.

Return type

string

property value: str

Get the value for this token.

Return type

str

class openclean.function.token.base.TokenPrefix(length: int)

Bases: openclean.function.token.base.TokenTransformer

Return a list that is a prefix for a given list. The returned list are a prefix for a given input of maximal length N (where N is a user-defined parameter). Input lists that have fewer elementes than N are returned as is.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Return a list that contains the first N elements of the input list, where N is the length parameter defined during initialization. If the input list does not have more than N elements the input is returned as it is.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.TokenTransformer

Bases: object

The token transformer manipulates a list of string tokens. Manipulations may include removing tokens from an input list, rearranging tokens or even adding new tokens to the list. Defines a single transform method that takes a list of strings as input and returns a (modified) list of strings.

abstract transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Transform a list of string tokens. Returns a modified copy of the input list of tokens.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.TokenTransformerPipeline(transformers: List[openclean.function.token.base.TokenTransformer])

Bases: openclean.function.token.base.TokenTransformer

Sequnce of token transformers that are applied on a given input list of string tokens.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Transform a list of string tokens. Applies the transformers in the pipeline sequentially on the output of the respective successor in the pipeline.

Parameters

tokens (list of string) – List of string openclean.function.token.base.Token.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.Tokenizer

Bases: object

Interface for string tokenizer. A string tokenizer should be able to handle any scalar value (e.g., by first transforming numeric values into a string representation). The tokenizer returns a list of token objects.

encode(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) List[List[openclean.function.token.base.Token]]

Encodes all values in a given column (i.e., list of values) into their type representations and tokenizes each value.

Parameters

values (list of scalar) – List of column values

Return type

list of list of openclean.function.token.base.Token

abstract tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]

Convert a given scalar values into a list of string tokens. If a given value cannot be converted into tokens None should be returned.

The order of tokens in the returned list not necissarily corresponds to their order in the original value. This is implementation dependent.

Parameters
  • value (scalar) – Value that is converted into a list of tokens.

  • rowidx (int, default=None) – Optional index of the dataset row that the value originates from.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.Tokens(tokenizer: openclean.function.token.base.Tokenizer, transformer: Optional[Union[List[openclean.function.token.base.TokenTransformer], openclean.function.token.base.TokenTransformer]] = None, delim: Optional[str] = '', sort: Optional[bool] = False, reverse: Optional[bool] = False, unique: Optional[bool] = False)

Bases: openclean.function.value.base.PreparedFunction, openclean.function.token.base.Tokenizer

The default tokenizer is a simple wrapper around a given tokenizer and an (optional) token transformer that is applied on the output of the given tokenizer.

This class provides to functionality to easily add default transformations to the generated token lists.

The default tokenizer also extends the ValueFunction class to provide functionality to concatenate the generated token list to a token key string.

eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str

Tokenize a given value and return a concatenated string of the resulting tokens.

Parameters

value (scalar or tuple) – Input value that is tokenized and concatenated.

Return type

string

tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]

Tokenize the given value using the associated tokenizer. Then modify the tokens with the optional token transformer.

Parameters
  • value (scalar) – Value that is converted into a list of tokens.

  • rowidx (int, default=None) – Optional index of the dataset row that the value originates from.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.UniqueTokens

Bases: openclean.function.token.base.TokenTransformer

Remove duplicate tokens to return a list of unique tokens.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Returns a list of unique tokens from the input list.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.UpdateTokens(func: Union[Callable, openclean.function.value.base.ValueFunction])

Bases: openclean.function.token.base.TokenTransformer

Update tokens by applying a value function to each of them.

transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]

Returns the list of tokens that results from applying the associated value function of each of the tokens in the input list.

Parameters

tokens (list of openclean.function.token.base.Token) – List of string tokens.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.base.UpperTokens

Bases: openclean.function.token.base.UpdateTokens

Convert all tokens in a given list to upper case.