openclean.function.token.base module
Interfaces for string tokenizer and token set transformers.
- class openclean.function.token.base.CapitalizeTokens
Bases:
openclean.function.token.base.UpdateTokens
Capitalize all tokens in a given list.
- class openclean.function.token.base.LowerTokens
Bases:
openclean.function.token.base.UpdateTokens
Convert all tokens in a given list to lower case.
- class openclean.function.token.base.ReverseTokens
Bases:
openclean.function.token.base.TokenTransformer
Reverse a given list of string tokens.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Return a reversed copy of the token list.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.SortTokens(key: Optional[Callable] = None, reverse: Optional[bool] = False)
Bases:
openclean.function.token.base.TokenTransformer
Sort a given token list in ascending or descending order.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Returns a sorted copy of the tken list.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.StandardizeTokens(mapping: Union[Dict, openclean.function.value.mapping.Standardize])
Bases:
openclean.function.token.base.UpdateTokens
Standardize tokens in a given list using a stamdardization mapping.
- class openclean.function.token.base.Token(value: str, token_type: Optional[str] = None, rowidx: Optional[int] = None)
Bases:
str
Tokens are strings that have an optional (semantic) type label.
The values for type labels are not constraint. It is good practice, to use all upper case values for token types. The default token type is ‘ANY’.
This implementation is based on: https://bytes.com/topic/python/answers/32098-my-experiences-subclassing-string
The order of creation is that the __new__ method is called which returns the object then __init__ is called.
- property regex_type: str
Synonym for getting the token type.
- Return type
str
- property size: int
Synonym to get the length of the token.
- Return type
int
- to_tuple() Tuple[str, str, int]
Returns a tuple of the string, type and value size.
- Return type
tuple of string, string, int
- type() str
Get token type value.
This is a wrapper around the
token_type
property. Returns the default token type ‘ANY’ if no type was given when the object was created.- Return type
string
- property value: str
Get the value for this token.
- Return type
str
- class openclean.function.token.base.TokenPrefix(length: int)
Bases:
openclean.function.token.base.TokenTransformer
Return a list that is a prefix for a given list. The returned list are a prefix for a given input of maximal length N (where N is a user-defined parameter). Input lists that have fewer elementes than N are returned as is.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Return a list that contains the first N elements of the input list, where N is the length parameter defined during initialization. If the input list does not have more than N elements the input is returned as it is.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.TokenTransformer
Bases:
object
The token transformer manipulates a list of string tokens. Manipulations may include removing tokens from an input list, rearranging tokens or even adding new tokens to the list. Defines a single transform method that takes a list of strings as input and returns a (modified) list of strings.
- abstract transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Transform a list of string tokens. Returns a modified copy of the input list of tokens.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.TokenTransformerPipeline(transformers: List[openclean.function.token.base.TokenTransformer])
Bases:
openclean.function.token.base.TokenTransformer
Sequnce of token transformers that are applied on a given input list of string tokens.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Transform a list of string tokens. Applies the transformers in the pipeline sequentially on the output of the respective successor in the pipeline.
- Parameters
tokens (list of string) – List of string openclean.function.token.base.Token.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.Tokenizer
Bases:
object
Interface for string tokenizer. A string tokenizer should be able to handle any scalar value (e.g., by first transforming numeric values into a string representation). The tokenizer returns a list of token objects.
- encode(values: List[Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]]) List[List[openclean.function.token.base.Token]]
Encodes all values in a given column (i.e., list of values) into their type representations and tokenizes each value.
- Parameters
values (list of scalar) – List of column values
- Return type
list of list of openclean.function.token.base.Token
- abstract tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]
Convert a given scalar values into a list of string tokens. If a given value cannot be converted into tokens None should be returned.
The order of tokens in the returned list not necissarily corresponds to their order in the original value. This is implementation dependent.
- Parameters
value (scalar) – Value that is converted into a list of tokens.
rowidx (int, default=None) – Optional index of the dataset row that the value originates from.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.Tokens(tokenizer: openclean.function.token.base.Tokenizer, transformer: Optional[Union[List[openclean.function.token.base.TokenTransformer], openclean.function.token.base.TokenTransformer]] = None, delim: Optional[str] = '', sort: Optional[bool] = False, reverse: Optional[bool] = False, unique: Optional[bool] = False)
Bases:
openclean.function.value.base.PreparedFunction
,openclean.function.token.base.Tokenizer
The default tokenizer is a simple wrapper around a given tokenizer and an (optional) token transformer that is applied on the output of the given tokenizer.
This class provides to functionality to easily add default transformations to the generated token lists.
The default tokenizer also extends the ValueFunction class to provide functionality to concatenate the generated token list to a token key string.
- eval(value: Union[int, float, str, datetime.datetime, Tuple[Union[int, float, str, datetime.datetime]]]) str
Tokenize a given value and return a concatenated string of the resulting tokens.
- Parameters
value (scalar or tuple) – Input value that is tokenized and concatenated.
- Return type
string
- tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]
Tokenize the given value using the associated tokenizer. Then modify the tokens with the optional token transformer.
- Parameters
value (scalar) – Value that is converted into a list of tokens.
rowidx (int, default=None) – Optional index of the dataset row that the value originates from.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.UniqueTokens
Bases:
openclean.function.token.base.TokenTransformer
Remove duplicate tokens to return a list of unique tokens.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Returns a list of unique tokens from the input list.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.UpdateTokens(func: Union[Callable, openclean.function.value.base.ValueFunction])
Bases:
openclean.function.token.base.TokenTransformer
Update tokens by applying a value function to each of them.
- transform(tokens: List[openclean.function.token.base.Token]) List[openclean.function.token.base.Token]
Returns the list of tokens that results from applying the associated value function of each of the tokens in the input list.
- Parameters
tokens (list of openclean.function.token.base.Token) – List of string tokens.
- Return type
list of openclean.function.token.base.Token
- class openclean.function.token.base.UpperTokens
Bases:
openclean.function.token.base.UpdateTokens
Convert all tokens in a given list to upper case.