openclean.function.token.split module

String tokenizer that is a wrapper around the string split method.

class openclean.function.token.split.ChartypeSplit(chartypes: Optional[List[Tuple[Callable, str]]] = None)

Bases: openclean.function.token.base.Tokenizer

Split values basesd of a list of character type functions. That is, a value that contains characters of different types, e.g., W35ST, will be split into tokens with homogeneous character type, e.g., [‘W’, ‘35’, ‘ST’].

The type of a character is determined by a classifier that is given as a list of Boolean predicates, i.e., callables that accept a single character and that return True if the charcter belongs to the type that the function represents or False otherwise. With each classifier a token type label is associated that is assigned to the generated token. If a token does not match any of the given classifier the default token type is returned.

get_type(c: str) str

The type of a character is the label that is associated with the first type predicate that returns True.

If no predicate evaluates to True for a given value None is returned.

Parameters

c (string) – Expects a single character string.

Return type

int

tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]

Convert a given scalar values into a list of string tokens. If a given value cannot be converted into tokens None should be returned.

The order of tokens in the returned list not necissarily corresponds to their order in the original value. This is implementation dependent.

Parameters
  • value (scalar) – Value that is converted into a list of tokens.

  • rowidx (int, default=None) – Optional index of the dataset row that the value originates from.

Return type

list of openclean.function.token.base.Token

class openclean.function.token.split.Split(pattern: str, sort: Optional[bool] = False, reverse: Optional[bool] = False, unique: Optional[bool] = False, preproc: Optional[Callable] = None, subtokens: Optional[openclean.function.token.base.Tokenizer] = None)

Bases: openclean.function.token.base.Tokenizer

String tokenizer that is a wrapper around the regular expression split method. Defines a extra parameters to (a) pre-process a given value and (b) modify the generated token lists.

The split operator allows to further split tokens that are generated by the standard split function using a nested string tokenizer.

tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]

Convert a given scalar values into a list of string tokens. If a given value cannot be converted into tokens None should be returned.

The order of tokens in the returned list not necissarily corresponds to their order in the original value. This is implementation dependent.

Parameters
  • value (scalar) – Value that is converted into a list of tokens.

  • rowidx (int, default=None) – Optional index of the dataset row that the value originates from.

Return type

list of openclean.function.token.base.Token