Sets Match - Using Key
How and Why to Specify a Key for Comparison
Why Comparison Key?
When you use sets match
expectation, you need to specify only the two queries and against what datasources they should be run. That’s it, nothing else. Both the sets need to be sorted (this is a job the database will do with the best performance).
CAT by default does the job and in fact, you don’t need to care. If you are happy with the error messages you get, this topic is of no interest for you.
Specifying a comparison key in fact provides only one additional benefit:
Without key, there is simply no way to recognize, whether the row is similar enough to be reported as different. When you don’t specify key, you only get report of
- rows missing in the first set
- rows missing in the second set
When you specify a comparison key, you get report of:
- rows missing in the first set
- rows missing in the second set
- different rows.
Remember the Key
is opitonal. It only gives you better error messages.
Where Do I Set the Comparison Key?
Just provide the Key
in YAML file, in the database table, in the Excel sheet - whereever you store your tests. Example of a test in a project file:
Tests:
- Name: Compare staging table with fact table - Passengers
Suite: Consistency checks
First Data Source: DWH
First Query: |
SELECT * FROM STAGE.PASSENGERS ORDER BY ID
Second Data Source: DWH
Second Query: |
SELECT * FROM FACT.PASSENGERS ORDER BY ID
Expectation: sets match
Key: ID
Maximum Errors Logged: 20
Both key columns must have the same data type. Name can be different (use name from either first or second set), but the data type must be the same. So if you have e.g., int
in first set and bigint
in the second, cast both to bigint
.
What the Difference Looks Like?
So OK, providing a key gives me better error message, but is it important?
It is up to you. Here you can see an error message without and with the key provided.
Key Was NOT Provided
ββββββ¬βββββ¬βββββββββ¬βββββββββββββββββ¬βββββ¬ββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββ
β(1) β(2) βID βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER βSYS_INSERTED_DT β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺβββββͺββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββͺββββββββββββββββββββββ‘
β ! β β2185893 β118463324 β73 βFemale β109 β2 β4 βSlovak βXXX331759 β8/15/2023 2:16:23 PM β
β β ! β2185893 β128463324 β17 βMale β106 β1 β1 βBritish βXXX331759 β8/15/2023 2:16:23 PM β
β ! β β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech βXXX260037 β8/15/2023 2:16:34 PM β
β β ! β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech βXXX677662 β8/15/2023 2:16:34 PM β
β β ! β2888700 βA27834187 β35 βFemale β110 β3 β4 βCzech βXXX948757 β8/15/2023 2:16:50 PM β
β β ! β3048201 βP77272017 β39 βFemale β172 β3 β4 βSlovak βXXX594553 β8/15/2023 2:16:57 PM β
β ! β β3104872 βK21988419 β43 βFemale β247 β2 β1 βAustrian βXXX303521 β8/15/2023 2:16:59 PM β
β ! β β3435893 β705598821 β57 βFemale β140 β3 β3 βCzech βXXX697294 β8/15/2023 2:17:12 PM β
β β ! β3590747 β920512139 β44 βFemale β64 β3 β1 βCzech βXXX825049 β8/15/2023 2:17:17 PM β
β ! β β4164729 β521221195 β37 βMale β91 β3 β2 βBritish βXXX732170 β8/15/2023 2:17:39 PM β
β β ! β4164729 β521221195 β37 βMale β91 β3 β3 βBritish βXXX732170 β8/15/2023 2:17:39 PM β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺβββββͺββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββͺββββββββββββββββββββββ‘
β(1) β(2) βID βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER βSYS_INSERTED_DT β
ββββββ΄βββββ΄βββββββββ΄βββββββββββββββββ΄βββββ΄ββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββ
Key WAS provided
ββββββ¬βββββ¬βββββββββ¬βββββββββββββββββ¬ββββββ¬ββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββββββββ
β(1) β(2) β(ID) βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺββββββͺββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββ‘
β < β β2185893 β! 118463324 β! 73 β! Female β! 109 β! 2 β! 4 β! Slovak βXXX331759 β
β β > β2185893 β! 128463324 β! 17 β! Male β! 106 β! 1 β! 1 β! British βXXX331759 β
β < β β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech β! XXX260037 β
β β > β2453078 βK27742062 β42 βFemale β110 β3 β4 βCzech β! XXX677662 β
β β ! β2888700 βA27834187 β35 βFemale β110 β3 β4 βCzech βXXX948757 β
β β ! β3048201 βP77272017 β39 βFemale β172 β3 β4 βSlovak βXXX594553 β
β ! β β3104872 βK21988419 β43 βFemale β247 β2 β1 βAustrian βXXX303521 β
β ! β β3435893 β705598821 β57 βFemale β140 β3 β3 βCzech βXXX697294 β
β β ! β3590747 β920512139 β44 βFemale β64 β3 β1 βCzech βXXX825049 β
β < β β4164729 β521221195 β37 βMale β91 β3 β! 2 βBritish βXXX732170 β
β β > β4164729 β521221195 β37 βMale β91 β3 β! 3 βBritish βXXX732170 β
β < β β4684428 β124528923 β36 βMale β127 β3 β2 βBritish β! XXX260038 β
β β > β4684428 β124528923 β36 βMale β127 β3 β2 βBritish β! XXX260039 β
β ! β β4890294 β225776255 β5 βMale β433 β2 β1 βCzech βXXX889446 β
ββββββͺβββββͺβββββββββͺβββββββββββββββββͺββββββͺββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββͺβββββββββββββββ‘
β(1) β(2) β(ID) βPASSPORT_NUMBER βAGE βGENDER βTICKET_PRICE βTICKET_TYPE βLUGGAGE_TYPE βNATIONALITY βFLIGHT_NUMBER β
ββββββ΄βββββ΄βββββββββ΄βββββββββββββββββ΄ββββββ΄ββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββββββββ
Remarks
Notice a few important things:
- Key column is always first in the table and is enclosed in brackets.
- When key is specified, only columns with difference are reported (notice the
SYS_INSERTED_DT
column is missing in the second example. - When key is specified, values with the difference have
!
prefix. - Different rows (with the same key, but at least one different column) are marked as
<
(values from first set) and>
(values from second set). - Because without the key specified rows are reported only as missing in first or missing in second, you get more errors with the key specified - even though the same
Maximum Errors Logged
setting was specified.