Rank by Vector Similarity¶

Description¶

Ranks objects in SOURCE [OBJ,STRING] according to the similarity scores of the vector embeddings of each STRING with those of QTERMS [STRING]. The vector embeddings are created using the selected embedding model. The similarity score is computed using the Euclidean distance between vectors.

Input¶

SOURCE [OBJ,STRING]: a 2-column input with an object-string pair. Typically obtained with the Extract string block
QTERMS [STRING]: a list of keywords to rank SOURCE objects against

Output¶

RETRIEVE [OBJ]: a list of ranked objects

Parameters¶

Embedding model: the embedding model used for creating the vector embeddings
- all-MiniLM-L6-v2
- snowflake-arctic-embed-l-v2.0
Pooling mode: how the embedded tokens of each input string are combined into one vector
- ‘MEAN’: the average value of each dimension across all tokens is taken; captures the overall meaning
- MAX: the highest value for each dimension across all tokens is taken; highlights the most prominent features
Chunk size: maximum number of characters that will be embedded as one vector. Strings longer than the chunk size will be split into multiple chunks.
Chunk overlap: number of characters that chunks should overlap. This intends to prevent information from being siloed into separate chunks.
Search type: the method used for vector similarity search
- EXACT: computes the exact distance between each source and query vector, only recommended for a small amount of source vectors (~100,000 or less)
- HNSW: computes the approximate distance between each source and query vector, based on the Hierarchical Navigable Small World algorithm.
K value: the amount of objects to retrieve when using an approximate Search type, greatly affects search time.
Index name: name necessary for storing the graph-based indices used during approximate search, needs to be unique per source data

Output scores can be normalised.

Note: When using HNSW, if the SOURCE vectors are changed/updated the index will not automatically update. Change Index name to create a new index and see the changes.