Rank by Text BM25

Description

Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with keywords in QUERY [STRING]. The relevance is computed using Okapi BM-25 ranking method.

Inputs

  • SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block

  • QTERMS [STRING]: a list of keywords to rank SOURCE objects against

Outputs

  • RETRIEVE [OBJ]: a list of ranked objects

Parameters

  • Stemming: tokens can be stemmed for a specific language or left as they are

  • Case-sensitive: if set to false, upper/lower case is ignored

  • Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form

  • Tokenization: the method to tokenize the input strings.

    • None: perform no tokenization

    • Spaces: all valid Unicode space characters

    • Spaces/Punctuation: Spaces + all valid Unicode punctuation characters

    • Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters

    • Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters

    • Custom Regular Expression: any regular expression

  • Min token length: tokens whose character length is shorter than this value are discarded

  • Gram type:

    • Word (default): each token is composed by UTF-8 word n-grams

    • Character: each token is composed by UTF-8 character n-grams

  • Grams: allows to extract n-gram tokens (default is 1)

  • All query terms must match: if set to true, only candidates where all tokens in a QTERMS entry match a string in SOURCE are considered a match (AND logic for terms)

  • One query per QTERMS row: if set to true, each row in QTERMS is considered as a separate query. All queries contributions are summed up (OR logic for queries)

  • k1: controls non-linear term frequency normalisation (saturation). Lower value = quicker saturation (term frequency is more quickly less important)

  • b: degree of document-length normalisation applied. 0=no normalisation, 1=full normalisation

Output scores can be normalised.