Tokenize [String]¶

Description¶

Separate a string into tokens. The tokenization method can be defined in the parameter (e.g. only tokenize by spaces or use all punctuation).

RESULT [STRING]: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.
PAIR [STRING, STRING]: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.

Tokenization: the method to tokenize the input strings.
- None: perform no tokenization
- Spaces: all valid Unicode space characters
- Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
- Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
- Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
- Custom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discarded
Gram type:
- Word (default): each token is composed by UTF-8 word n-grams
- Character: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)
Stemming: tokens can be stemmed for a specific language or left as they are
Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalised.