Rank by Text TF-IDF¶

Description¶

Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING]. The relevance score is computed using a generic Vector Space Model framework, which can be customised to implement several weighting schemes. The default weighting scheme is [tf-idf|https://en.wikipedia.org/wiki/Tf%E2%80%93idf].

Inputs¶

SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block
QTERMS [STRING]: a list of keywords to rank SOURCE objects against

Outputs¶

RETRIEVE [OBJ]: a list of ranked objects

Parameters¶

Notice that not all combinations are expected to work well. Also, some methods inherently perform score normalisations, others do not.

Stemming: tokens can be stemmed for a specific language or left as they are
Case-sensitive: if set to false, upper/lower case is ignored
Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form
Tokenization: the method to tokenize the input strings.
- None: perform no tokenization
- Spaces: all valid Unicode space characters
- Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
- Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
- Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
- Custom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discarded
Gram type:
- Word (default): each token is composed by UTF-8 word n-grams
- Character: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)
All query terms must match: if set to true, only candidates where all tokens in QTERMS match a string in SOURCE are considered a match
Document TF: term frequency weight for documents in SOURCE
- BNRY: binary, only encodes term occurrence, ignoring the number of occurrences
- FREQ: frequency, encodes term frequency (number of occurrences)
- LOGA: logarithmic (aka log normalisation)
- LOGN: normalised logarithmic (aka average log normalisation)
- ANTF05: augmented normalised (aka double normalisation 0.5)
- BM25: Okapi BM-25 term frequency
  - k1: controls non-linear term frequency normalisation (saturation). Lower value = quicker saturation (term frequency is more quickly less important)
  - b: degree of document-length normalisation applied. 0=no normalisation, 1=full normalisation
Document IDF: inverse document frequency weight for documents in SOURCE
- NONE: unary (constant 1)
- IDFB: inverse document frequency
- IDFP: smoothed probabilistic inverse document frequency
- BM25: Okapi BM-25 inverse document frequency
Document normalisation:
- NONE: no normalisation
- DL: document-length normalisation (longer = smaller prior)
- PUQN: pivoted unique document length normalisation
  - Slope: tunable parameter for PUQN
Query TF: term frequency weight for documents in QUERY
- (same options as for Document TF)
Query IDF: inverse document frequency weight for documents in QUERY
- (same options as for Document IDF)

Output scores can be normalised.