Tokenize [String]¶
Description¶
Separate a string into tokens. The tokenization method can be defined in the parameter (e.g. only tokenize by spaces or use all punctuation).
Input¶
SOURCE [STRING]
: a list of strings. Each string is tokenized.
Output¶
RESULT [STRING]
: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.PAIR [STRING, STRING]
: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.
Parameters¶
Tokenization
: the method to tokenize the input strings.None
: perform no tokenizationSpaces
: all valid Unicode space charactersSpaces/Punctuation
:Spaces
+ all valid Unicode punctuation charactersSpaces/Punctuation/Digits
:Spaces/Punctuation
+ all valid Unicode digit charactersSpaces/Punctuation/Digits/Symbols
:Spaces/Punctuation/Digits
+ all valid Unicode symbol charactersCustom Regular Expression
: any regular expression
Min token length
: tokens whose character length is shorter than this value are discardedGram type
:Word
(default): each token is composed by UTF-8 word n-gramsCharacter
: each token is composed by UTF-8 character n-grams
Grams
: allows to extract n-gram tokens (default is 1)Stemming
: tokens can be stemmed for a specific language or left as they areCase-sensitive
: if set tofalse
, upper/lower case is ignored
Output scores can be aggregated and/or normalised.