Matching Algorithms

This document describes Matching Algorithms.

Matching Algorithms contain a set of SQL-like rules that determine whether the DR Key Properties of two records indicate a match, ie the source (new record being checked) and destination (existing record being compared) are the same Data Record.

Why Matching?

By default, if a matching algorithm is not specified, records are determined to match only if all of the values of their DR Key Properties are identical. Custom rules and/or score based matching Algorithms are useful for identifying duplicate records in different systems where some data may or may not completely match.

Examples:

A system has an old address or last name for someone who has moved or gotten married.
Some systems may or may not have PII that would otherwise easily match two records. For example, if one system identifies a person by SSN, but not all other systems do, the matching algorithm could try alternate methods of matching the two records (name, date of birth, address, etc).
Systems may store values somewhat differently. For example, one system may run addresses through a verification system to get the exact spelling information the postal service desires, whereas another may accept hand entered data without validation.
etc

Overview

Matching algorithms follow a SQL-like pattern with groups of comparisons wrapped in parenthesis and connected by logical AND and OR operators. YOUnite supports two similar types of matching algorithms: rules based and score based.

Rules based algorithms use a set of SQL-like rules to determine a match. If all rules evaluate to true, then the records match, otherwise they do not.

Score based algorithms also use a set of SQL-like rules to determine a match, however, matched records are then given a "score" based on another set of criteria and depending on the score the record is determined to either be a definite match, a possible match or not a match at all.

Note	Matching algorithms may reference any property in the list of DR Key Properties of the Data Domain. Properties not included in the DR Key Properties list cannot be used for matching.

Source and Destination Data Records

This document refers to "source" and "destination" in terms of whether one Data Record matches another Data Record.

The "source" is the new, inbound Data Record that triggered the matching algorithm. Existing Data Records that are a match or potential match are referred to as the "destination" Data Record.

Rules Based Matching Algorithms

Rules based algorithms use a set of SQL-like rules to determine a match. If all rules evaluate to true, then the records match, otherwise they do not.

Rules Based Matching Algorithm Examples

For the purposes of this guide, matching algorithms will be written as a property name followed by an operator and modifiers and logical AND/OR operators and parenthesis. The YOUnite UI also previews matching algorithms using this syntax.

Note that only properties included in the list of DR Key Properties for the Data Domain can be used in the matching algorithm.

Examples:

First name and last name must match:

firstName EQ AND lastName EQ

First name and last name must match (case insensitive):

firstName EQ CASE INSENSITIVE AND lastName EQ CASE INSENSITIVE

Case insensitive fuzzy match with a maximum levenshtein distance of 5:

address EQ LEVENSHTEIN(5) CASE INSENSITIVE

User ID must match, or either the source or destination value is null:

userId EQ OR userId SRC_NULL OR userId DEST_NULL

Destination record must be active (active = true):

active DEST_CONSTANT "true"

Full example:

(userId EQ OR userId SRC_NULL OR userId DEST_NULL)
AND ((firstName EQ CASE INSENSITIVE AND lastName EQ CASE INSENSITIVE)
  OR (firstName EQ CASE INSENSITIVE AND address EQ LEVENSHTEIN(5) CASE INSENSITIVE)
  OR (firstName EQ CASE INSENSITIVE AND dateOfBirth EQ))
AND (active DEST_CONSTANT "true")

Score Based Matching Algorithms

Score based algorithms use the same type of criteria to match records as Rules Based Matching Algorithms above, however, matched records are then given a "score" based on another set of criteria and depending on the score the record is determined to either be a definite match, a possible match or not a match at all.

Note	The matching criteria is optional for a score based matching algorithm, but is recommended, particularly if any fuzzy matching is being used. See Performance Consideration Examples for more information on the performance implications.

Score Based Matching Algorithm Examples

The matching criteria for score based matching algorithms follows the same syntax as the examples rules based algorithms above. Next, separate criteria is used to determine the score.

For the purposes of this guide, scoring algorithms will be written as IF, THEN ELSE statements with the same operators as used for matching. An optional WHEN clause may be included to determine whether to evaluate the scoring algorithm step at all.

First, Last, Date of Birth

In this simple example, first name, last name and date of birth are considered. If two of the three match, the record is considered a "possible" match (20 points). If all three match the record is considered a "definite" match (30 points).

Matching criteria:

firstName EQ CASE INSENSITIVE OR lastName EQ CASE INSENSITIVE OR dateOfBirth EQ

Scoring criteria:

1. IF firstName EQ CASE INSENSITIVE THEN 10 ELSE 0
2. IF lastName EQ CASE INSENSITIVE THEN 10 ELSE 0
3. IF dateOfBirth EQ CASE THEN 10 ELSE 0

Possible match = 20
Definite match = 30

See Example JSON for a more complicated example.

Ambiguous And Possible Matches

With both Rules Based and Scored Based matching, ambiguous matches are possible.

For Rules Based matching, a match is considered ambiguous if the criteria matched to more than one Data Record.

For Score Based matching, a match is considered ambiguous if a match was given a score in the "possible" range, or if there was a "definite" match two two or more Data Records.

Ambiguous matches must be resolved through manual intervention by a user, or through a workflow. See Data Issues and Data Event Exceptions.

Matching Operators

The following operators can be used to match values.

In the API this corresponds to the comparisonType field.

Name	comparisonType Code	Description	Modifiers
Equal	EQ	Compare two properties for equality	Case insensitive, fuzzy matching (see below)
Not equal	NE	Compare two properties for inequality	Case insensitive, fuzzy matching (see below)
Source value is NULL	SRC_NULL	True if the source (new value) is null
Destination value is NULL	DEST_NULL	True if the destination (existing value) is null
Source is NOT NULL	SRC_NOT_NULL	True if the source (new value) is not null
Destination value is NOT NULL	DEST_NOT_NULL	True if the destination (existing value) is not null
Source value is a constant	SRC_CONSTANT	True if the source (new value) matches a constant value	Case insensitive
Destination value is a constant	DEST_CONSTANT	True if the destination (existing value) matches a constant value	Case insensitive
Source value is not equal to a constant	SRC_NE_CONSTANT	True if the source (new value) does not match a constant value	Case insensitive
Destination value is not equal to a constant	DEST_NE_CONSTANT	True if the destination (existing value) does not match a constant value	Case insensitive

Name

comparisonType Code

Description

Modifiers

Equal

Compare two properties for equality

Case insensitive, fuzzy matching (see below)

Not equal

Compare two properties for inequality

Case insensitive, fuzzy matching (see below)

Source value is NULL

SRC_NULL

True if the source (new value) is null

Destination value is NULL

DEST_NULL

True if the destination (existing value) is null

Source is NOT NULL

SRC_NOT_NULL

True if the source (new value) is not null

Destination value is NOT NULL

DEST_NOT_NULL

True if the destination (existing value) is not null

Source value is a constant

SRC_CONSTANT

True if the source (new value) matches a constant value

Case insensitive

Destination value is a constant

DEST_CONSTANT

True if the destination (existing value) matches a constant value

Case insensitive

Source value is not equal to a constant

SRC_NE_CONSTANT

True if the source (new value) does not match a constant value

Case insensitive

Destination value is not equal to a constant

DEST_NE_CONSTANT

True if the destination (existing value) does not match a constant value

Case insensitive

Null Value Handling

Null values pose particular problems for matching and "equality" and it is important to understand their behavior.

YOUnite takes an approach to nulls similar to SQL databases: any operation performed on a null value will return false, even if both values are null. Special operators must be used instead to check if a value is null: SRC_NULL and DEST_NULL.

The following chart illustrates the true / false outcomes when nulls are present with various operators:

Operator	Source value is null only	Destination value is null only	Both values are null
EQ	false	false	false
NE	false	false	false
SRC_CONSTANT	false	true or false	false
DEST_CONSTANT	true or false	false	false
SRC_NE_CONSTANT	false	true or false	false
DEST_NE_CONSTANT	true or false	false	false
SRC_NULL	true	false	true
SRC_NOT_NULL	false	true	false
DEST_NULL	false	true	true
DEST_NOT_NULL	true	false	false

Operator

Source value is null only

Destination value is null only

Both values are null

false

SRC_CONSTANT

false

true or false

false

DEST_CONSTANT

true or false

false

SRC_NE_CONSTANT

false

true or false

false

DEST_NE_CONSTANT

true or false

false

SRC_NULL

true

false

true

SRC_NOT_NULL

false

true

false

DEST_NULL

false

true

DEST_NOT_NULL

true

false

The following are some examples of recipes for null handling. For illustrative purposes, the name of the property in these example will be "favoriteColor".

Match if both values match and both are not null (this is the default behavior):

favoriteColor EQ

Match if both values match or both are null:

favoriteColor EQ OR (favoriteColor SRC_NULL AND favoriteColor DEST_NULL)

Match if both values match or if either is null:

favoriteColor EQ OR favoriteColor SRC_NULL OR favoriteColor DEST_NULL

Fuzzy matching

Several fuzzy matching algorithms are available for partially matching string values.

Note	Case insensitivity and "not equal" operators can be used with any of these algorithms, although some are already case insensitive. When "not equal" is applied to a fuzzy matching algorithm it means a positive result returns false instead of true.

In the API, the fields fuzzyMatchType, fuzzyNum and fuzzyPct contain these values. fuzzyNum is for integer parameters and fuzzyPct is for floating point parameters.

Name	fuzzyMatchType Code	Description	Always Case Insensitive?	fuzzyNum parameter	fuzzyPct parameter
Levenshtein Distance	LEVENSHTEIN	Distance between two strings	No	Maximum distance	Maximum distance as a percentage of the length of the destination string
Trigram Similarity	TRIGRAM	Similarity of the trigrams of two strings	Yes		Minimum percentage of trigrams that match
Soundex	SOUNDEX	Soundex comparison	Yes	Optional. If specified, minimum number of soundex characters that must match (0 - 4).
Metaphone	METAPHONE	Metaphone comparison	Yes	Maximum length of metaphone produced (1 - 255)
Double Metaphone	DMETAPHONE	Double Metaphone comparison	Yes
Double Metaphone (alternate)	DMETAPHONE_ALT	Double Metaphone comparison with alternate value	Yes

Name

fuzzyMatchType Code

Description

Always Case Insensitive?

fuzzyNum parameter

fuzzyPct parameter

Levenshtein Distance

LEVENSHTEIN

Distance between two strings

Maximum distance

Maximum distance as a percentage of the length of the destination string

Trigram Similarity

TRIGRAM

Similarity of the trigrams of two strings

Yes

Minimum percentage of trigrams that match

Soundex

SOUNDEX

Soundex comparison

Yes

Optional. If specified, minimum number of soundex characters that must match (0 - 4).

Metaphone

METAPHONE

Metaphone comparison

Yes

Maximum length of metaphone produced (1 - 255)

Double Metaphone

DMETAPHONE

Double Metaphone comparison

Yes

Double Metaphone (alternate)

DMETAPHONE_ALT

Double Metaphone comparison with alternate value

Yes

Levenshtein Distance

The levenshtein fuzzy matching algorithm compares two strings and returns the number of characters that would need to be changed in the destination value to match the source value. This is known as the levenshtein distance.

For matching purposes, the levenshtein method will return true if the numeric difference between the two strings is less than or equal to a supplied parameter.

Another option is to supply a percentage which indicates the maximum percentage of difference compared to the total number of characters in the destination. Note that this indicates the percentage of difference not similarity!

Note	By default this operation is CASE SENSITIVE.

Example:

Destination = "Fuzzy string"
Source      = "Frizzy string"

Difference = 2 (the "r" and "i" in the second string must be changed / removed)

If the supplied parameter is 2 or more, the match will return true.

If instead a percentage is supplied it would need to be about .17 (17%) or more (12 characters divided by 2
differences = 0.1666666666666....)

Trigram Similarity

Trigrams are a list of all combinations of three consecutive characters in a string. In addition, the first character, the first two characters and the last two characters are also included in the list.

For example, the string trigram has the following trigrams:

t, tr, tri, rig, igr, gra, ram, am

For matching purposes, the simililary between the trigrams of two string values can be measured by dividing the total number of matching trigrams of both strings by the total number of unique trigrams. The supplied parameter for this operation will indicate the minimum similarity.

Note	This operation is always CASE INSENSITIVE.

Example:

Destination = "trigram"
Source      = "trgrato"

Destination trigrams = t, tr, tri, rig, igr, gra, ram, am
Source trigrams      = t, tr, trg, rgr, gra, rat, ato, to

Matching trigrams = t, tr, gra (3)
Unique trigrams = t, tr, tri, tro, rig, rgr, igr, gra, rat, ram, ato, am, to (13)

Similarity = 3 / 13 or 0.23076923 (about 23%)

Caveats:

Each distinct trigram value is included only a single time. For example, trigramtrigram and trigram have nearly the same trigram values, although the former has one additional value: amt. This may make them less useful for longer strings and sentences.
Values with multiple words have the trigrams of each word calculated and included.

Soundex

Soundex is a system of matching similar sounding names. The resulting value is a 4-character representation of what the sounds like.

For matching purposes, two names are determined to be equal if their soundex value matches. Optionally, a parameter can be supplied to indicate the minimum amount of similarity (number of characters that match).

Note	This operation is always CASE INSENSITIVE. It also reportedly is not good for non-English names.

Example

Destination = "Joan"
Source      = "Jane"

Both these names have the same soundex value and will match (J500).

Desination = "Jonathan"
Source     = "Jim"

These two names do not have the same soundex value (J535 vs J500), however,if a parameter of 2 is supplied,
they will be determined a match as two of the four characters are the same.

Metaphone

Metaphone is similar to Soundex in that it constructs a string representation of what a word or words sounds like. Unlike soundex, it is capable of producing much longer "sounds like" strings, up to 255 characters, controlled by the supplied parameter.

Note	This operation is always CASE INSENSITIVE.

Example

Destination = "Metaphone Matching"
Source      = "Metafone Matchink"

When supplied with a maximum length of 10, both these strings produce a metaphone of "MTFNMTXNK" and therefore
are a match. Notice that only 9 characters were actually needed to represent each string.

Caveats

The supplied parameter specifies the maximum output length of the string. The actual value may be shorter.
This algorithm is only capable of handling strings up to 255 characters in length. If a string is longer, it will be truncated. This also means the parameter cannot be larger than 255.

Double Metaphone, Alternate Double Metaphone

Double Metaphone has two "sounds like" options: "primary" and "alternate". These two "sounds like" strings may or may not be the same.

Unlike Metaphone, this operation does not accept a parameter and produces an output of up to 4 characters.

Note	This operation is always CASE INSENSITIVE.

Destination = "Jonathan"
Source      = "Johnathon"

Using either the primary or alternate method, these two names match, with metaphones of "JN0N" and "ANTN" respectively.

API Specification

YOUnite provides an easy to use User Interface for defining matching algorithms. If instead the API is to be called directly, the syntax is as follows:

Note	`matchingAlgorithm` is part of the JSON payload for a POST or PUT on a Data Domain Version.

matchingAlgorithm

Matching algorithm specification

Property	Description	Type	Default Value	Required
description	Description	String	none	no
type	Type of matching algorithm. Allowed values: RULE_BASED or SCORE_BASED.	String	none	yes
possibleMatchScore	Number that indicates the minimum score to be considered a "possible" or "ambiguous" match	Integer	none	no*
definiteMatchScore	Number that indicates the minimum score to be considered a "definite" or "positive" match	Integer	none	no*
matchingGroup	Top level matching group	MatchingGroup	none	yes for RULE_BASED^
scoringGroup	List of specifications used to score a match	List of ScoringGroup	none	yes for SCORE_BASED

Property

Description

Type

Default Value

Required

description

Description

String

none

type

Type of matching algorithm.

Allowed values: RULE_BASED or SCORE_BASED.

String

none

yes

possibleMatchScore

Number that indicates the minimum score to be considered a "possible" or "ambiguous" match

Integer

none

no*

definiteMatchScore

Number that indicates the minimum score to be considered a "definite" or "positive" match

Integer

none

no*

matchingGroup

Top level matching group

MatchingGroup

none

yes for RULE_BASED^

scoringGroup

List of specifications used to score a match

List of ScoringGroup

none

yes for SCORE_BASED

*SCORE_BASED matching algorithms must supply a possibleMatchScore and/or a definiteMatchScore. One may be null but not both.

^if a SCORE_BASED matching algorithm does not supply a matchingGroup, one will automatically be created by OR’ing all the groups in the scoringGroup together. This may or may not be desirable, therefore it is recommended that a matchingGroup be supplied for more complex SCORE_BASED matching algorithms to narrow down the results. See Performance Considerations.

matchingGroup

A Matching Group is a list of either matching steps or child matching groups. A single operator (AND or OR) is applied between each of these steps or child groups. A Matching Group must contain at least one matching step or one child matching group, but cannot contain both.

Property Description Type Default Value Required

Property	Description	Type	Default Value	Required
matchingSteps	Matching steps in the group When specified, each matching step is connected with the value specified in `operator`.	List of MatchingStep	none	no*
childGroups	Child matching groups. When specified, each child group is connected with the value specified in `operator`.	List of MatchingGroup	none	no*
operator	Operator to apply between child steps or groups. Valid values are: `AND` and `OR`.	String	none	yes if more than one child group or step

matchingSteps

Matching steps in the group

When specified, each matching step is connected with the value specified in operator.

List of MatchingStep

none

no*

childGroups

Child matching groups.

When specified, each child group is connected with the value specified in operator.

List of MatchingGroup

none

no*

operator

Operator to apply between child steps or groups.

Valid values are: AND and OR.

String

none

yes if more than one child group or step

*Either matchingSteps or childGroups must be specified but not both.

matchingStep

Step in a matching group. Each step produces a logical statement that will be evaluated to true or false.

Property Description Type Default Value Required

Property	Description	Type	Default Value	Required
field	Property name	String	none	yes
comparisonType	Type of comparison to make. Valid values are: `EQ`, `NE`, `SRC_NULL`, `SRC_NOT_NULL`, `SRC_CONSTANT`, `SRC_NE_CONSTANT`, `DEST_NULL`, `DEST_NOT_NULL`, `DEST_CONSTANT`, `DEST_NE_CONSTANT`.	String	none	yes
constantValue	Constant value to use. Note that regardless of the data type of the property, this value must be expressed as a String. Type conversion will be performed automatically before matching occurs.	String	none	yes if `comparisonType` is for a constant
caseInsensitiveMatch	Perform a case insensitive match	Boolean	false	no
fuzzyMatchType	Fuzzy match type to perform. Only applies to a comparisonType of `EQ` or `NE`. Valid values are: `LEVENSHTEIN`, `TRIGRAM`, `SOUNDEX`, `METAPHONE`, `DMETAPHONE`, `DMETAPHONE_ALT`.	String	none	no
fuzzyNum	Integer parameter for fuzzy match types. Required for `METAPHONE`. Optional for `SOUNDEX`. Either fuzzyNum or fuzzyPct required for `LEVENSHTEIN`.	Integer	none	no
fuzzyPct	Floating point parameter for fuzzy match types. Required for `TRIGRAM`. Either fuzzyNum or fuzzyPct required for `LEVENSHTEIN`.	Double	none	no

field

Property name

String

none

yes

comparisonType

Type of comparison to make.

Valid values are: EQ, NE, SRC_NULL, SRC_NOT_NULL, SRC_CONSTANT, SRC_NE_CONSTANT, DEST_NULL, DEST_NOT_NULL, DEST_CONSTANT, DEST_NE_CONSTANT.

String

none

yes

constantValue

Constant value to use.

Note that regardless of the data type of the property, this value must be expressed as a String. Type conversion will be performed automatically before matching occurs.

String

none

yes if comparisonType is for a constant

caseInsensitiveMatch

Perform a case insensitive match

Boolean

false

fuzzyMatchType

Fuzzy match type to perform.

Only applies to a comparisonType of EQ or NE.

Valid values are: LEVENSHTEIN, TRIGRAM, SOUNDEX, METAPHONE, DMETAPHONE, DMETAPHONE_ALT.

String

none

fuzzyNum

Integer parameter for fuzzy match types.

Required for METAPHONE.

Optional for SOUNDEX.

Either fuzzyNum or fuzzyPct required for LEVENSHTEIN.

Integer

none

fuzzyPct

Floating point parameter for fuzzy match types.

Required for TRIGRAM.

Either fuzzyNum or fuzzyPct required for LEVENSHTEIN.

Double

none

scoringGroup

A specification of how to score a potential match.

Property	Description	Type	Default Value	Required
usageCriteria	Optional criteria to determine if this scoring group should be evaluated. If this criteria is present and evaluates to false, neither a positive or a negative score will be applied to the overall result.	MatchingGroup	none	no
matchCriteria	Criteria to determine if this scoring group is a match or not	MatchingGroup	none	yes
matchScore	Score to apply if matchCriteria evaluates to true (value is added to overall result)	Integer	none	no
matchPenalty	Penalty to apply if matchCriteria evaluates to false (value is subtracted from overall result)	Integer	none	no

Property

Description

Type

Default Value

Required

usageCriteria

Optional criteria to determine if this scoring group should be evaluated.

If this criteria is present and evaluates to false, neither a positive or a negative score will be applied to the overall result.

MatchingGroup

none

matchCriteria

Criteria to determine if this scoring group is a match or not

MatchingGroup

none

yes

matchScore

Score to apply if matchCriteria evaluates to true (value is added to overall result)

Integer

none

matchPenalty

Penalty to apply if matchCriteria evaluates to false (value is subtracted from overall result)

Integer

none

Example JSON

Below is example JSON for posting a Data Domain Version for this matching criteria:

(userId EQ OR userId SRC_NULL OR userId DEST_NULL)
AND ((firstName EQ CASE INSENSITIVE AND lastName EQ CASE INSENSITIVE)
  OR (firstName EQ CASE INSENSITIVE AND address EQ LEVENSHTEIN(5) CASE INSENSITIVE)
  OR (firstName EQ CASE INSENSITIVE AND dateOfBirth EQ))
AND (active DEST_CONSTANT "true")

And this scoring criteria:

1. IF userId EQ THEN 30 ELSE -30 WHEN (userId SRC_NOT_NULL AND userId DEST_NOT_NULL)
2. IF firstName EQ CASE INSENSITIVE THEN 20 ELSE -10
3. IF lastName EQ CASE INSENSITIVE THEN 20 ELSE -10
4. IF dateOfBirth EQ THEN 20 ELSE -10
5. IF address EQ LEVENSHTEIN(5) CASE INSENSITIVE THEN 20 ELSE -10

possible match = 20
definite match = 60

POST /domains/(uuid)/versions

{
	"modelSchema": {
		"properties": {
			"userId": {
				"type": "string"
			},
			"firstName": {
				"type": "string"
			},
			"lastName": {
				"type": "string"
			},
			"gender": {
				"type": "string"
			},
			"email": {
				"type": "string"
			},
			"dateOfBirth": {
				"type": "string"
			},
			"phone": {
				"type": "string"
			},
			"address": {
				"type": "string"
			},
			"active": {
				"type": "boolean"
			}
		}
	},
    "drKeyProperties": [
        "userId",
        "firstName",
        "lastName",
        "dateOfBirth",
        "address",
        "email",
        "phone",
        "status"
    ],
    "matchingAlgorithm": {
        "description": "An optional description for an example score based matching algorithm",
        "type": "SCORE_BASED",
        "possibleMatchScore": 20,
        "definiteMatchScore": 60,
        "matchingGroup": {
            "operator": "AND",
            "childGroups": [
                {
                    "operator": "OR",
                    "matchingSteps": [
                        {
                            "field": "userId",
                            "comparisonType": "EQ"
                        },
                        {
                            "field": "userId",
                            "comparisonType": "SRC_NULL"
                        },
                        {
                            "field": "userId",
                            "comparisonType": "DEST_NULL"
                        }
                    ]
                },
                {
                    "operator": "OR",
                    "childGroups": [
                        {
                            "operator": "AND",
                            "matchingSteps": [
                                {
                                    "field": "firstName",
                                    "comparisonType": "EQ",
                                    "caseInsensitive": true
                                },
                                {
                                    "field": "lastName",
                                    "comparisonType": "EQ",
                                    "caseInsensitive": true
                                }
                            ]
                        },
                        {
                            "operator": "AND",
                            "matchingSteps": [
                                {
                                    "field": "firstName",
                                    "comparisonType": "EQ",
                                    "caseInsensitive": true
                                },
                                {
                                    "field": "address",
                                    "comparisonType": "EQ",
                                    "caseInsensitiveMatch": true,
                                    "fuzzyMatchType": "LEVENSHTEIN",
                                    "fuzzyNum": 5
                                }
                            ]
                        },
                        {
                            "operator": "AND",
                            "matchingSteps": [
                                {
                                    "field": "firstName",
                                    "comparisonType": "EQ",
                                    "caseInsensitive": true
                                },
                                {
                                    "field": "dateOfBirth",
                                    "comparisonType": "EQ"
                                }
                            ]
                        }
                    ]
                },
                {
                    "matchingSteps": [
                        {
                            "field": "active",
                            "comparisonType": "DEST_CONSTANT",
                            "constantValue": "true"
                        }
                    ]
                }
            ]
        },
        "scoringGroups": [
            {
                "usageCriteria": {
                    "operator": "AND",
                    "matchingSteps": [
                        {
                            "field": "userId",
                            "comparisonType": "SRC_NOT_NULL"
                        },
                        {
                            "field": "userId",
                            "comparisonType": "DEST_NOT_NULL"
                        }
                    ]
                },
                "matchCriteria": {
                    "matchingSteps": [
                        {
                            "field": "userId",
                            "comparisonType": "EQ"
                        }
                    ]
                },
                "matchScore": 30,
                "matchPenalty": 30
            },
            {
                "matchCriteria": {
                    "matchingSteps": [
                        {
                            "field": "firstName",
                            "comparisonType": "EQ",
                            "caseInsensitiveMatch": true
                        }
                    ]
                },
                "matchScore": 20,
                "matchPenalty": 10
            },
            {
                "matchCriteria": {
                    "matchingSteps": [
                        {
                            "field": "lastName",
                            "comparisonType": "EQ",
                            "caseInsensitiveMatch": true
                        }
                    ]
                },
                "matchScore": 20,
                "matchPenalty": 10
            },
            {
                "matchCriteria": {
                    "matchingSteps": [
                        {
                            "field": "dateOfBirth",
                            "comparisonType": "EQ"
                        }
                    ]
                },
                "matchScore": 20,
                "matchPenalty": 10
            },
            {
                "matchCriteria": {
                    "matchingSteps": [
                        {
                            "field": "address",
                            "comparisonType": "EQ",
                            "caseInsensitiveMatch": true,
                            "fuzzyMatchType": "LEVENSHTEIN",
                            "fuzzyNum": 5
                        }
                    ]
                },
                "matchScore": 20,
                "matchPenalty": 10
            }
        ]
    }
}

Performance Considerations

YOUnite optimizes DR Key matching lookups by using indexes in the underlying database, however, only some operations can make use of these indexes. In general, fuzzy matching does NOT use indexing and neither do "NOT EQUAL" operations. Case insensitive matches do use indexing.

For small datasets or data sets that see little activity, the performance may not be that critical. However, for large data sets, non-indexed lookups can cause slowdowns and/or bottlenecking to occur.

For optimal performance it is recommended that the matching criteria make use of at least one indexed property to narrow the results down so that less performant operations (such as fuzzy matching) can do their work on a smaller data set instead of every single Data Record.

Performance Consideration Examples

Bad. All matching is fuzzy.

dateOfBirth EQ LEVENSHTEIN(4) AND (firstName EQ SOUNDEX OR lastName EQ SOUNDEX)

Bad. because the steps are OR’d together, a full scan of all records will still occur for firstName and lastName.

dateOfBirth EQ OR firstName EQ SOUNDEX OR lastName EQ SOUNDEX

Good. dateOfBirth will narrow results before fuzzy matching is done on firstName and LastName.

dateOfBirth EQ AND (firstName EQ SOUNDEX OR lastName EQ SOUNDEX)

Good. Multiple options for an indexed match first before applying fuzzy logic to results.

(dateOfBirth EQ
    OR firstName EQ CASE INSENSITIVE
    OR lastName EQ CASE INSENSITIVE
    OR email EQ CASE INSENSITIVE
    OR phone EQ)
AND (firstName EQ SOUNDEX OR lastName EQ SOUNDEX)
AND (email EQ LEVENSHTEIN(5) CASE INSENSITIVE OR address EQ LEVENSHTEIN(5) CASE INSENSITIVE OR phone EQ)

Testing Matching Algorithms

An API endpoint exists to allow testing matching algorithms with mock data.

To test out a matching algorithm you will first need to:

Assign the matching algorithm to a Data Domain Version. Make note of the UUID of the Data Domain and the Data Domain Version.
Have at least one Adaptor defined in a Zone. Make note of the Zone and Adaptor UUIDs.
Typically each mock data record would be present at a different adaptor, however, you can still test if you only have one adaptor but the results will always show "AMBIGUOUS" as a DR can only exist

To test out the matching algorithm, issue the following command:

POST /zones/(zone-uuid)/adaptors/(adaptor-uuid)/matching-testing

Notes:

The zone-uuid and adaptor-uuid indicate the zone / adaptor for which the mock data event that is being tested originates.

The body of the request will look something like this:

{
    "name": "one",
    "domainVersionUuid": "68b90673-e5d7-44d6-bb90-81276be9d944",
    "drKey": {
        "firstName": "Steve",
        "lastName": "Martin",
        "address": "1 Main St",
        "city": "Chico",
        "state": "CA",
        "zip": "95928"
    },
    "testDrKeys": [
        {
            "name": "two",
            "adaptorUuid": "fcccfbcf-5083-4289-bacf-e94bfc19ea36",
            "drKey": {
                "firstName": "STEVE",
                "lastName": "MARTIN",
                "address": "12 Lovers Lane",
                "city": "San Francisco",
                "state": "CA",
                "zip": "85744"
            }
        },
        {
            "name": "three",
            "adaptorUuid": "552cf0b5-1bc1-4120-9409-ed63c5134db4",
            "drKey": {
                "firstName": "Robin",
                "lastName": "Williams",
                "address": "321 Breezy Blvd",
                "city": "Portland",
                "state": "OR",
                "zip": "97217"
            }
        }
    ]
}

The result of this request will look something like this:

{
    "name": "one",
    "drKey": {
        "zip": "95928",
        "firstName": "Steve",
        "lastName": "Martin",
        "address": "1 Main St",
        "city": "Chico",
        "state": "CA"
    },
    "adaptor": {
        "uuid": "6e15d7a8-ba28-43e2-8fbf-349fb0d38964",
        "name": "Test Adaptor 1"
    },
    "matchStatus": "AMBIGUOUS",
    "matches": [
        {
            "name": "two",
            "adaptor": {
                "uuid": "fcccfbcf-5083-4289-bacf-e94bfc19ea36",
                "name": "Test Adaptor 2"
            },
            "drKey": {
                "zip": "85744",
                "firstName": "STEVE",
                "lastName": "MARTIN",
                "address": "12 Lover Lane",
                "city": "San Francisco",
                "state": "CA"
            },
            "score": 40,
            "matchRating": "POSSIBLE_MATCH"
        }
    ],
    "unmatchedResults": [
        {
            "name": "three",
            "adaptor": {
                "uuid": "552cf0b5-1bc1-4120-9409-ed63c5134db4",
                "name": "Test Adaptor 3"
            },
            "drKey": {
                "zip": "97217",
                "firstName": "Robin",
                "lastName": "Williams",
                "address": "321 Breezy Blvd",
                "city": "Portland",
                "state": "OR"
            },
            "matchRating": "NO_MATCH"
        }
    ],
    "domainVersionUuid": "68b90673-e5d7-44d6-bb90-81276be9d944",
    "matchingAlgorithm": { ... }
}

Notes:

The name property is optional and may be used to help identify the test records.
matchRating will be one of the following: NO_MATCH, POSSIBLE_MATCH or DEFINITE_MATCH.
matchingTestResults indicate which records where identified as a potential match, even if the end result was NO_MATCH.
unmatchedTestResults indicate which records were not identified as a potential match.