# Rank by Text TF-IDF

### Description
Ranks objects in `SOURCE [OBJ,STRING]` according to the relevance score of each `STRING` with keywords in `QUERY [STRING]`.
The relevance score is computed using a generic [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) framework, which can be customised to implement several weighting schemes.
The default weighting scheme is [tf-idf|https://en.wikipedia.org/wiki/Tf%E2%80%93idf].

### Inputs
- `SOURCE [OBJ,STRING]`: a 2-column input with an object-string pair. Typically obtained with the `Extract string` block
- `QTERMS [STRING]`: a list of keywords to rank `SOURCE` objects against

### Outputs
- `RETRIEVE [OBJ]`: a list of ranked objects

### Parameters
Notice that not all combinations are expected to work well.
Also, some methods inherently perform score normalisations, others do not.
- `Stemming`: tokens can be stemmed for a specific language or left as they are
- `Case-sensitive`: if set to `false`, upper/lower case is ignored
- `Normalize diacritics`: transliterates non-ASCII characters into their closest ASCII form
- `Tokenization`: the method to tokenize the input strings.
  - `None`: perform no tokenization
  - `Spaces`: all valid Unicode space characters
  - `Spaces/Punctuation`: `Spaces` + all valid Unicode punctuation characters
  - `Spaces/Punctuation/Digits`: `Spaces/Punctuation` + all valid Unicode digit characters
  - `Spaces/Punctuation/Digits/Symbols`: `Spaces/Punctuation/Digits` + all valid Unicode symbol characters
  - `Custom Regular Expression`: any [regular expression](https://www.regular-expressions.info)
- `Min token length`: tokens whose character length is shorter than this value are discarded
- `Gram type`:
  - `Word` (default): each token is composed by UTF-8 word n-grams
  - `Character`: each token is composed by UTF-8 character n-grams
- `Grams`: allows to extract n-gram tokens (default is 1)
- `All query terms must match`: if set to `true`, only candidates where all tokens in `QTERMS` match a string in `SOURCE` are considered a match
- `Document TF`: term frequency weight for documents in `SOURCE`
  - `BNRY`: [binary](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), only encodes term occurrence, ignoring the number of occurrences
  - `FREQ`: [frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), encodes term frequency (number of occurrences)
  - `LOGA`: [logarithmic](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (aka [log normalisation](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))
  - `LOGN`: normalised logarithmic (aka [average log normalisation](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))
  - `ANTF05`: augmented normalised (aka [double normalisation 0.5](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))
  - `BM25`: [Okapi BM-25](https://en.wikipedia.org/wiki/Okapi_BM25) term frequency
    - `k1`: controls non-linear term frequency normalisation (saturation). Lower value = quicker saturation (term frequency is more quickly less important)
    - `b`: degree of document-length normalisation applied. `0`=no normalisation, `1`=full normalisation
- `Document IDF`: inverse document frequency weight for documents in `SOURCE`
  - `NONE`: [unary](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (constant `1`)
  - `IDFB`: [inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
  - `IDFP`: [smoothed probabilistic inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
  - `BM25`: [Okapi BM-25](https://en.wikipedia.org/wiki/Okapi_BM25) inverse document frequency
- `Document normalisation`:
  - `NONE`: no normalisation
  - `DL`: document-length normalisation (longer = smaller prior)
  - `PUQN`: [pivoted unique document length normalisation](http://www.academia.edu/4088434/Pivoted_document_length_normalization)
    - `Slope`: tunable parameter for `PUQN`
- `Query TF`: term frequency weight for documents in `QUERY`
  - (same options as for `Document TF`)
- `Query IDF`: inverse document frequency weight for documents in `QUERY`
  - (same options as for `Document IDF`)

Output scores can be [normalised](docs://score_normalisation).