Rank by Vector Similarity¶
Description¶
Ranks objects in SOURCE [OBJ,STRING]
according to the similarity scores of the vector embeddings of each STRING
with those of QTERMS [STRING]
.
The vector embeddings are created using the selected embedding model.
The similarity score is computed using the Euclidean distance between vectors.
Input¶
SOURCE [OBJ,STRING]
: a 2-column input with an object-string pair. Typically obtained with theExtract string
blockQTERMS [STRING]
: a list of keywords to rankSOURCE
objects against
Output¶
RETRIEVE [OBJ]
: a list of ranked objects
Parameters¶
Embedding model
: the embedding model used for creating the vector embeddingsPooling mode
: how the embedded tokens of each input string are combined into one vector‘MEAN’: the average value of each dimension across all tokens is taken; captures the overall meaning
MAX
: the highest value for each dimension across all tokens is taken; highlights the most prominent features
Chunk size
: maximum number of characters that will be embedded as one vector. Strings longer than the chunk size will be split into multiple chunks.Chunk overlap
: number of characters that chunks should overlap. This intends to prevent information from being siloed into separate chunks.Search type
: the method used for vector similarity searchEXACT
: computes the exact distance between each source and query vector, only recommended for a small amount of source vectors (~100,000 or less)HNSW
: computes the approximate distance between each source and query vector, based on the Hierarchical Navigable Small World algorithm.
K value
: the amount of objects to retrieve when using an approximateSearch type
, greatly affects search time.Index name
: name necessary for storing the graph-based indices used during approximate search, needs to be unique per source data
Output scores can be normalised.
Note: When using HNSW
, if the SOURCE
vectors are changed/updated the index will not automatically update. Change Index name
to create a new index and see the changes.