Tokenize [String]

Description

Separate a string into tokens. The tokenization method can be defined in the parameter (e.g. only tokenize by spaces or use all punctuation).

Input

  • SOURCE [STRING]: a list of strings. Each string is tokenized.

Output

  • RESULT [STRING]: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.

  • PAIR [STRING, STRING]: the original string and the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled.

Parameters

  • Tokenization: the method to tokenize the input strings.

    • None: perform no tokenization

    • Spaces: all valid Unicode space characters

    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters

    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters

    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters

    • Custom Regular Expression: any regular expression

  • Min token length: tokens whose character length is shorter than this value are discarded

  • Gram type:

    • Word (default): each token is composed by UTF-8 word n-grams

    • Character: each token is composed by UTF-8 character n-grams

  • Grams: allows to extract n-gram tokens (default is 1)

  • Stemming: tokens can be stemmed for a specific language or left as they are

  • Case-sensitive: if set to false, upper/lower case is ignored

Output scores can be aggregated and/or normalised.