Sets Match - Using Key
How and Why to Specify a Key for Comparison
Why Comparison Key?
When you use sets match expectation, you need to specify only the two queries and against what datasources they should be run. That’s it, nothing else. Both the sets need to be sorted (this is a job the database will do with the best performance).
CAT by default does the job and in fact, you don’t need to care. If you are happy with the error messages you get, this topic is of no interest for you.
Specifying a comparison key in fact provides only one additional benefit:
Without key, there is simply no way to recognize, whether the row is similar enough to be reported as different. When you don’t specify the key, you only get report of
- rows missing in the first set
- rows missing in the second set
When you specify a comparison key, you get report of:
- rows missing in the first set
- rows missing in the second set
- different rows.
Remember the Key is opitonal for sets match comparison. It only gives you better error messages.
Where Do I Set the Comparison Key?
Just provide the Key in YAML file, in the database table, in the Excel sheet - whereever you store your tests. Example of a test in a project file:
Tests:
- Name: Compare staging table with fact table - Passengers
Suite: Consistency checks
First Data Source: DWH
First Query: |
SELECT * FROM STAGE.PASSENGERS ORDER BY ID
Second Data Source: DWH
Second Query: |
SELECT * FROM FACT.PASSENGERS ORDER BY ID
Expectation: sets match
Key: ID
Maximum Errors Logged: 20
Instead of column name(s), you can use ordinal position of the column(s). If the ID in the example above is the first column, just provide 1, if second, provide 2 etc.
You can also tell CAT that the first column is the key. In that case you do not put the column name, but one of these constants: First is Key or The First Column is Key.
Requirements for Key column(s)
When using the keys, beware of three things:
βΌοΈ The sets need to be sorted based on the key, in ascending order. CAT expects both the sets are sorted, even though you provided the comparison key. Your database is best at sorting data, let it do it. CAT will then very efficiently compare the values, with no pressure on your computer’s resources. The sorting must be in ascending order. Sorting on both sides must use the same collation. We don’t recommend using columns with non-ASCII characters as key columns - if there are non-ASCII characters, binary collation is needed.
βΌοΈ CAT expects the key is unique.
βΌοΈ Key columns on both sides must have “similar enough” data types. If you specify column of bigint on one side and int on the other, it is OK. If one is datetime and other boolean, CAT will refuse to compare the sets.
OK, “similar enough data types” is not very clear. Here is what the algorithm considers acceptable. If types of both sides fit to exactly one of these bullets, you are OK:
- exactly same types
- all types of integers and decimal numbers
- date/time types
- types that based on their type or length can possibly be unique identifiers
- string values (such as
char,varchar, arrays of chars)
Names of key columns can be different. Use name from either first or second set when specifying the key.
If the rules for uniqueness or sort are not met, CAT issues a warning, which is part of the error message. If the rule for compatible data types in key columns is not met, CAT sets the test result to Error and stops its evaluation.
Composite Key
In database world, it is very common that rather then one column, a combination of two or more columns is unique. CAT supports that.
- You can define more columns in key by using names, use comma or semicolon as splitter:
TerminalID, GateID - You can use “number” of the column instead of name - first column is refered to as 1, second as 2 etc:
1, 2 - You can even combine the two approaches:
1, GateID - You can tell CAT that the all columns except the last one are key columns:
All Except Last,All Columns Except the Last One
The data types of the columns in the key may differ, but they must match with the other side (see the rule described above).
What the Difference Looks Like?
So OK, providing a key gives me better error message, but is it important?
It is up to you. Here you can see an error message without and with the key provided.
Key Was NOT Provided
ββββββ¬βββββ¬βββββββββ¬βββββββββββββββββ¬βββββ¬ββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββ
β(1) β(2) βID βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER βSYS_INSERTED_DT β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺβββββͺββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββͺββββββββββββββββββββββ‘
β ! β β2185893 β118463324 β73 βFemale β109 β2 β4 βSlovak βXXX331759 β8/15/2023 2:16:23 PM β
β β ! β2185893 β128463324 β17 βMale β106 β1 β1 βBritish βXXX331759 β8/15/2023 2:16:23 PM β
β ! β β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech βXXX260037 β8/15/2023 2:16:34 PM β
β β ! β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech βXXX677662 β8/15/2023 2:16:34 PM β
β β ! β2888700 βA27834187 β35 βFemale β110 β3 β4 βCzech βXXX948757 β8/15/2023 2:16:50 PM β
β β ! β3048201 βP77272017 β39 βFemale β172 β3 β4 βSlovak βXXX594553 β8/15/2023 2:16:57 PM β
β ! β β3104872 βK21988419 β43 βFemale β247 β2 β1 βAustrian βXXX303521 β8/15/2023 2:16:59 PM β
β ! β β3435893 β705598821 β57 βFemale β140 β3 β3 βCzech βXXX697294 β8/15/2023 2:17:12 PM β
β β ! β3590747 β920512139 β44 βFemale β64 β3 β1 βCzech βXXX825049 β8/15/2023 2:17:17 PM β
β ! β β4164729 β521221195 β37 βMale β91 β3 β2 βBritish βXXX732170 β8/15/2023 2:17:39 PM β
β β ! β4164729 β521221195 β37 βMale β91 β3 β3 βBritish βXXX732170 β8/15/2023 2:17:39 PM β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺβββββͺββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββͺββββββββββββββββββββββ‘
β(1) β(2) βID βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER βSYS_INSERTED_DT β
ββββββ΄βββββ΄βββββββββ΄βββββββββββββββββ΄βββββ΄ββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββ
Key WAS provided
ββββββ¬βββββ¬βββββββββ¬βββββββββββββββββ¬ββββββ¬ββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββββ
β(1) β(2) β(ID) βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺββββββͺββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββ‘
β < β β2185893 β! 118463324 β! 73 β! Female β! 109 β! 2 β! 4 β! Slovak βXXX331759 β
β β > β2185893 β! 128463324 β! 17 β! Male β! 106 β! 1 β! 1 β! British βXXX331759 β
β < β β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech β! XXX260037 β
β β > β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech β! XXX677662 β
β β ! β2888700 βA27834187 β35 βFemale β110 β3 β4 βCzech βXXX948757 β
β β ! β3048201 βP77272017 β39 βFemale β172 β3 β4 βSlovak βXXX594553 β
β ! β β3104872 βK21988419 β43 βFemale β247 β2 β1 βAustrian βXXX303521 β
β ! β β3435893 β705598821 β57 βFemale β140 β3 β3 βCzech βXXX697294 β
β β ! β3590747 β920512139 β44 βFemale β64 β3 β1 βCzech βXXX825049 β
β < β β4164729 β521221195 β37 βMale β91 β3 β! 2 βBritish βXXX732170 β
β β > β4164729 β521221195 β37 βMale β91 β3 β! 3 βBritish βXXX732170 β
β < β β4684428 β124528923 β36 βMale β127 β3 β2 βBritish β! XXX260038 β
β β > β4684428 β124528923 β36 βMale β127 β3 β2 βBritish β! XXX260039 β
β ! β β4890294 β225776255 β5 βMale β433 β2 β1 βCzech βXXX889446 β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺββββββͺββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββ‘
β(1) β(2) β(ID) βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER β
ββββββ΄βββββ΄βββββββββ΄βββββββββββββββββ΄ββββββ΄ββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββββ
Remarks
Notice a few important things:
- Key columns always come first in the erroneous data sample and are enclosed in brackets. (Notice the
(ID)column in the example.) - When key is specified, only columns with difference are reported (Notice the
SYS_INSERTED_DTcolumn is missing in the second example.) - When key is specified, values with the difference have
!prefix. There is always one exclamation mark in each non-key column. (Notice the 73 vs 17 age for ID 2185893.) - Different rows (with the same key, but at least one different column) are marked as
<(values from first set) and>(values from second set). - Note that the number of rows found differs based on whether you specified the key or not. Even though the same
Maximum Errors Loggedsetting was specified. Without the key specified, same row with difference is counted as two (missing on left an missing on right), while with the key specified this counts as one difference.