openclean.function.token.ngram module
String tokenizer that returns a list of n-grams. A n-gram in this case is a substring of length n.
- class openclean.function.token.ngram.NGrams(n: int, pleft: Optional[str] = None, pright: Optional[str] = None)
Bases:
openclean.function.token.base.Tokenizer
Split values into lists of n-grams. n-grams are substrings of length n. Provides the option to pad stings with special characters to the left and right before computing n-grams. That is, if a left (right) padding character is given (e.g. $), a string containing n-1 padding characters will be added to the left (right) of a given string before n-gams are computer.
If no padding is specified (default) the value is split into n-grams as is. If the string does not contain more than n characters the string is returned as is.
- tokens(value: Union[int, float, str, datetime.datetime], rowidx: Optional[int] = None) List[openclean.function.token.base.Token]
Convert a given scalar values into a list of n-grams. If the value length is not greater than n and no padding was specified, the returned list will only contain the given value.
- Parameters
value (scalar) – Value that is converted into a list of n-grams.
rowidx (int, default=None) – Optional index of the dataset row that the value originates from.
- Return type
list of openclean.function.token.base.Token