Match by string

Description

Finds matches between the STRING-columns in the inputs. Various comparison options can be chosen: equals, contains, startsWith, endsWith or edit-distance. The result provides both the matching items, as well as the items from both inputs that didn’t generate a match.

Optional input Cands [OBJ,OBJ] can limit the matching to only the pairs of candidates listed.

  • The first column corresponds to the first column of A.

  • The second column corresponds to the first column of B.

  • Scores are propagated to final matches.

Input

  • A [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result

  • Cands [OBJ,OBJ] (optional): candidate pairs, only As and Bs that are in Cands will be matched

  • B [OBJ,STRING]: a list of candidates, in which the STRING-column will be used for comparison and the OBJ-column will be the result

Output

  • RESULT [OBJ,OBJ]: the matched objects from A and B

  • NOTA [OBJ]: the objects from A that did not match with an item from B

  • NOTB [OBJ]: the objects from B that did not match with an item from A

Parameters

  • Comparison: Comparison function to use

    • equal: the strings must be equal

    • contains: the string in B must be contained in A

    • containsWholeWord: the string in B must be contained in A, as a whole word (only punctuation/spaces around)

    • startsWith: the string in A must start with B

    • endsWith: the string in A must end with B

    • prefix: strings in A and B share a prefix of a given length

    • levenshtein: the string in A may not have more than Max edit-distance differences (character insertions or deletions) with B.

    • jaro-winkler: the strings in A and B must have a Jaro-Winkler similarity score not smaller than Min similarity.

  • Case-sensitive: if set to false, upper/lower case is ignored

  • Exclude self-matches: whether to emit the match if the objects in A and B are the same. Mostly useful when A and B come from the same source