Match by BM25¶

Description¶

This is a multi-query BM25 block, multiple lists of query keywords instead of a single one. It is in fact equivalent to a matching operation. It finds matches between the STRING-columns in the inputs by calculating the BM25 relevance score.

Input¶

Because this is originally a retrieval block, the notation SOURCE / QTERMS will be used, instead of A / B as in other matching blocks.

SOURCE [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result
QTERMS [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result

Output¶

RESULT [OBJ,OBJ]: the matched objects from SOURCE and QTERMS

Parameters¶

Stemming: tokens can be stemmed for a specific language or left as they are
Case-sensitive: if set to false, upper/lower case is ignored
Normalize diacritics: transliterates non-ASCII characters into their closest ASCII form
Tokenization: the method to tokenize the input strings.
- None: perform no tokenization
- Spaces: all valid Unicode space characters
- Spaces/Punctuation: Spaces + all valid Unicode punctuation characters
- Spaces/Punctuation/Digits: Spaces/Punctuation + all valid Unicode digit characters
- Spaces/Punctuation/Digits/Symbols: Spaces/Punctuation/Digits + all valid Unicode symbol characters
- Custom Regular Expression: any regular expression
Min token length: tokens whose character length is shorter than this value are discarded
Gram type:
- Word (default): each token is composed by UTF-8 word n-grams
- Character: each token is composed by UTF-8 character n-grams
Grams: allows to extract n-gram tokens (default is 1)
All query terms must match: if set to true, only candidates where all tokens in QTERMS match a string in SOURCE are considered a match
k1: controls non-linear term frequency normalisation (saturation). Lower value = quicker saturation (term frequency is more quickly less important)
b: degree of document-length normalisation applied. 0=no normalisation, 1=full normalisation