# Tokenize [Obj,String]

### Description
Separate a string into tokens. The tokenization method can be defined in the parameter
(e.g. only tokenize by spaces or use all punctuation).

### Input
- `SOURCE [OBJ,STRING]`: a list of object-string pairs. Each string is tokenized.

### Output
- `PAIR [OBJ,STRING]`: a result pair contains an object from the input source and a token from the tokenized string. Thus each token from the string is returned as a separate result pair.
- `RESULT [STRING]`: the extracted tokens. Use the score aggregation parameter to define how occurrences of the same token are handled. Notice that reference to which object each token came from is lost.

### Parameters
- `Tokenization`: the method to tokenize the input strings.
  - `None`: perform no tokenization
  - `Spaces`: all valid Unicode space characters
  - `Spaces/Punctuation`: `Spaces` + all valid Unicode punctuation characters
  - `Spaces/Punctuation/Digits`: `Spaces/Punctuation` + all valid Unicode digit characters
  - `Spaces/Punctuation/Digits/Symbols`: `Spaces/Punctuation/Digits` + all valid Unicode symbol characters
  - `Custom Regular Expression`: any [regular expression](https://www.regular-expressions.info)
- `Min token length`: tokens whose character length is shorter than this value are discarded
- `Gram type`:
  - `Word` (default): each token is composed by UTF-8 word n-grams
  - `Character`: each token is composed by UTF-8 character n-grams
- `Grams`: allows to extract n-gram tokens (default is 1)
- `Stemming`: tokens can be stemmed for a specific language or left as they are
- `Case-sensitive`: if set to `false`, upper/lower case is ignored

Output scores can be [aggregated](docs://score_aggregation) and/or [normalised](docs://score_normalisation).