Rank by Boolean Expr BM25¶

Description¶

Ranks objects in SOURCE [OBJ,STRING] according to the relevance score of each STRING with the expression in QUERY [STRING]. The relevance is computed using Okapi BM-25 ranking method.

Inputs¶

SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block

Outputs¶

RESULT [OBJ]: a list of ranked objects

Parameters¶

Query: a boolean query
- Use and, or (case does not matter) to express conjunctions and disjunctions of terms
- Use parentheses to group sub-expressions
- Negations are not yet supported
- Quotes to group terms into a phrase are not yet supported
- Example: apple AND (pear OR banana)
Stemming: tokens can be stemmed for a specific language or left as they are
Case-sensitive: if set to false, upper/lower case is ignored
Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form
Tokenization: the method to tokenize the input strings.
- None: perform no tokenization
- Spaces: all valid Unicode space characters
- Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
- Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
- Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
- Custom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discarded
All query terms must match: if set to true, only candidates where all tokens in QTERMS match a string in SOURCE are considered a match
k1: controls non-linear term frequency normalisation (saturation). Lower value = quicker saturation (term frequency is more quickly less important)
b: degree of document-length normalisation applied. 0=no normalisation, 1=full normalisation