openclean.function.token.ngram module

String tokenizer that returns a list of n-grams. A n-gram in this case is a substring of length n.

class openclean.function.token.ngram.NGrams(n: int, pleft: Optional[str] = None, pright: Optional[str] = None)

Bases: openclean.function.token.base.Tokenizer

Split values into lists of n-grams. n-grams are substrings of length n. Provides the option to pad stings with special characters to the left and right before computing n-grams. That is, if a left (right) padding character is given (e.g. $), a string containing n-1 padding characters will be added to the left (right) of a given string before n-gams are computer.

If no padding is specified (default) the value is split into n-grams as is. If the string does not contain more than n characters the string is returned as is.

tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]

Convert a given scalar values into a list of n-grams. If the value length is not greater than n and no padding was specified, the returned list will only contain the given value.

Parameters
  • value (scalar) – Value that is converted into a list of n-grams.

  • rowidx (int, default=None) – Optional index of the dataset row that the value originates from.

Return type

list of openclean.function.token.base.Token