# Rank by Text LM

### Description
Ranks objects in `SOURCE [OBJ,STRING]` according to the relevance score of each `STRING` with keywords in `QUERY [STRING]`.
The relevance is computed using the [Language modelling](http://www.cs.cmu.edu/~czhai/paper/sigir2001-smooth.pdf) ranking method.
Smoothing variants implemented: Jelinek-Mercer, Dirichlet, Dirichlet parameter-free.

 : param μ, equivalent to Jelinek-Mercer with `λ = μ / (μ + |D|)`
- Dirichlet (param-free): `μ = avg(|D|)`

### Inputs
- `SOURCE [OBJ,STRING]`: a 2-column input with an object-string pair. Typically obtained with the `Extract string` block
- `QTERMS [STRING]`: a list of keywords to rank `SOURCE` objects against

### Outputs
- `RETRIEVE [OBJ]`: a list of ranked objects

### Parameters
- `Stemming`: tokens can be stemmed for a specific language or left as they are
- `Case-sensitive`: if set to `false`, upper/lower case is ignored
- `Normalize diacritics`: transliterates non-ASCII characters into their closest ASCII form
- `Tokenization`: the method to tokenize the input strings.
  - `None`: perform no tokenization
  - `Spaces`: all valid Unicode space characters
  - `Spaces/Punctuation`: `Spaces` + all valid Unicode punctuation characters
  - `Spaces/Punctuation/Digits`: `Spaces/Punctuation` + all valid Unicode digit characters
  - `Spaces/Punctuation/Digits/Symbols`: `Spaces/Punctuation/Digits` + all valid Unicode symbol characters
  - `Custom Regular Expression`: any [regular expression](https://www.regular-expressions.info)
- `Min token length`: tokens whose character length is shorter than this value are discarded
- `Gram type`:
  - `Word` (default): each token is composed by UTF-8 word n-grams
  - `Character`: each token is composed by UTF-8 character n-grams
- `Grams`: allows to extract n-gram tokens (default is 1)
- `Smoothing`: smoothing method
  - `Jelinek-Mercer`: linear interpolation between foreground document model and background collection model
    - `λ`: `0` = only foreground, `1` = only background
  - `Dirichlet`: equivalent to `Jelinek-Mercer` where `λ = μ / (μ + |D|)`
    - `μ`: collection and query specific parameter. `0` = only foreground, `2000` = generic default.
  - `Dirichlet (param-free)`: `Dirichlet` with `μ = avg(|D|)`

Output scores can be [normalised](docs://score_normalisation).