Get Help

Sets Match - Using Key

How and Why to Specify a Key for Comparison

Why Comparison Key?

When you use sets match expectation, you need to specify only the two queries and against what datasources they should be run. That’s it, nothing else. Both the sets need to be sorted (this is a job the database will do with the best performance).

CAT by default does the job and in fact, you don’t need to care. If you are happy with the error messages you get, this topic is of no interest for you.

Specifying a comparison key in fact provides only one additional benefit:

Without key, there is simply no way to recognize, whether the row is similar enough to be reported as different. When you don’t specify the key, you only get report of

  • rows missing in the first set
  • rows missing in the second set

When you specify a comparison key, you get report of:

  • rows missing in the first set
  • rows missing in the second set
  • different rows.

Remember the Key is opitonal for sets match comparison. It only gives you better error messages.

Where Do I Set the Comparison Key?

Just provide the Key in YAML file, in the database table, in the Excel sheet - whereever you store your tests. Example of a test in a project file:

Tests:
- Name: Compare staging table with fact table - Passengers
  Suite: Consistency checks
  First Data Source: DWH
  First Query: |
     SELECT  * FROM STAGE.PASSENGERS ORDER BY ID
  Second Data Source: DWH
  Second Query: |
     SELECT  * FROM FACT.PASSENGERS ORDER BY ID
  Expectation:  sets match
  Key: ID
  Maximum Errors Logged: 20

Instead of column name(s), you can use ordinal position of the column(s). If the ID in the example above is the first column, just provide 1, if second, provide 2 etc.

You can also tell CAT that the first column is the key. In that case you do not put the column name, but one of these constants: First is Key or The First Column is Key.

Requirements for Key column(s)

When using the keys, beware of three things:

‼️ The sets need to be sorted based on the key, in ascending order. CAT expects both the sets are sorted, even though you provided the comparison key. Your database is best at sorting data, let it do it. CAT will then very efficiently compare the values, with no pressure on your computer’s resources. The sorting must be in ascending order. Sorting on both sides must use the same collation. We don’t recommend using columns with non-ASCII characters as key columns - if there are non-ASCII characters, binary collation is needed.

‼️ CAT expects the key is unique.

‼️ Key columns on both sides must have “similar enough” data types. If you specify column of bigint on one side and int on the other, it is OK. If one is datetime and other boolean, CAT will refuse to compare the sets.

OK, “similar enough data types” is not very clear. Here is what the algorithm considers acceptable. If types of both sides fit to exactly one of these bullets, you are OK:

  • exactly same types
  • all types of integers and decimal numbers
  • date/time types
  • types that based on their type or length can possibly be unique identifiers
  • string values (such as char, varchar, arrays of chars)

Names of key columns can be different. Use name from either first or second set when specifying the key.

If the rules for uniqueness or sort are not met, CAT issues a warning, which is part of the error message. If the rule for compatible data types in key columns is not met, CAT sets the test result to Error and stops its evaluation.

Composite Key

In database world, it is very common that rather then one column, a combination of two or more columns is unique. CAT supports that.

  • You can define more columns in key by using names, use comma or semicolon as splitter: TerminalID, GateID
  • You can use “number” of the column instead of name - first column is refered to as 1, second as 2 etc: 1, 2
  • You can even combine the two approaches: 1, GateID
  • You can tell CAT that the all columns except the last one are key columns: All Except Last, All Columns Except the Last One

The data types of the columns in the key may differ, but they must match with the other side (see the rule described above).

What the Difference Looks Like?

So OK, providing a key gives me better error message, but is it important?

It is up to you. Here you can see an error message without and with the key provided.

Key Was NOT Provided

β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚(1) β”‚(2) β”‚ID      β”‚PASSPORT_NUMBER β”‚AGE β”‚GENDER β”‚TICKET_PRICE β”‚TICKET_TYPE β”‚LUGGAGE_TYPE β”‚NATIONALITY β”‚FLIGHT_NUMBER β”‚SYS_INSERTED_DT      β”‚
β•žβ•β•β•β•β•ͺ════β•ͺ════════β•ͺ════════════════β•ͺ════β•ͺ═══════β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ════════════β•ͺ══════════════β•ͺ═════════════════════║
β”‚ !  β”‚    β”‚2185893 β”‚118463324       β”‚73  β”‚Female β”‚109          β”‚2           β”‚4            β”‚Slovak      β”‚XXX331759     β”‚8/15/2023 2:16:23 PM β”‚
β”‚    β”‚ !  β”‚2185893 β”‚128463324       β”‚17  β”‚Male   β”‚106          β”‚1           β”‚1            β”‚British     β”‚XXX331759     β”‚8/15/2023 2:16:23 PM β”‚
β”‚ !  β”‚    β”‚2453078 β”‚K27742062       β”‚42  β”‚Female β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚XXX260037     β”‚8/15/2023 2:16:34 PM β”‚
β”‚    β”‚ !  β”‚2453078 β”‚K27742062       β”‚42  β”‚Female β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚XXX677662     β”‚8/15/2023 2:16:34 PM β”‚
β”‚    β”‚ !  β”‚2888700 β”‚A27834187       β”‚35  β”‚Female β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚XXX948757     β”‚8/15/2023 2:16:50 PM β”‚
β”‚    β”‚ !  β”‚3048201 β”‚P77272017       β”‚39  β”‚Female β”‚172          β”‚3           β”‚4            β”‚Slovak      β”‚XXX594553     β”‚8/15/2023 2:16:57 PM β”‚
β”‚ !  β”‚    β”‚3104872 β”‚K21988419       β”‚43  β”‚Female β”‚247          β”‚2           β”‚1            β”‚Austrian    β”‚XXX303521     β”‚8/15/2023 2:16:59 PM β”‚
β”‚ !  β”‚    β”‚3435893 β”‚705598821       β”‚57  β”‚Female β”‚140          β”‚3           β”‚3            β”‚Czech       β”‚XXX697294     β”‚8/15/2023 2:17:12 PM β”‚
β”‚    β”‚ !  β”‚3590747 β”‚920512139       β”‚44  β”‚Female β”‚64           β”‚3           β”‚1            β”‚Czech       β”‚XXX825049     β”‚8/15/2023 2:17:17 PM β”‚
β”‚ !  β”‚    β”‚4164729 β”‚521221195       β”‚37  β”‚Male   β”‚91           β”‚3           β”‚2            β”‚British     β”‚XXX732170     β”‚8/15/2023 2:17:39 PM β”‚
β”‚    β”‚ !  β”‚4164729 β”‚521221195       β”‚37  β”‚Male   β”‚91           β”‚3           β”‚3            β”‚British     β”‚XXX732170     β”‚8/15/2023 2:17:39 PM β”‚
β•žβ•β•β•β•β•ͺ════β•ͺ════════β•ͺ════════════════β•ͺ════β•ͺ═══════β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ════════════β•ͺ══════════════β•ͺ═════════════════════║
β”‚(1) β”‚(2) β”‚ID      β”‚PASSPORT_NUMBER β”‚AGE β”‚GENDER β”‚TICKET_PRICE β”‚TICKET_TYPE β”‚LUGGAGE_TYPE β”‚NATIONALITY β”‚FLIGHT_NUMBER β”‚SYS_INSERTED_DT      β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key WAS provided

β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚(1) β”‚(2) β”‚(ID)    β”‚PASSPORT_NUMBER β”‚AGE  β”‚GENDER   β”‚TICKET_PRICE β”‚TICKET_TYPE β”‚LUGGAGE_TYPE β”‚NATIONALITY β”‚FLIGHT_NUMBER β”‚
β•žβ•β•β•β•β•ͺ════β•ͺ════════β•ͺ════════════════β•ͺ═════β•ͺ═════════β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ════════════β•ͺ══════════════║
β”‚ <  β”‚    β”‚2185893 β”‚! 118463324     β”‚! 73 β”‚! Female β”‚! 109        β”‚! 2         β”‚! 4          β”‚! Slovak    β”‚XXX331759     β”‚
β”‚    β”‚ >  β”‚2185893 β”‚! 128463324     β”‚! 17 β”‚! Male   β”‚! 106        β”‚! 1         β”‚! 1          β”‚! British   β”‚XXX331759     β”‚
β”‚ <  β”‚    β”‚2453078 β”‚K27742062       β”‚42   β”‚Female   β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚! XXX260037   β”‚
β”‚    β”‚ >  β”‚2453078 β”‚K27742062       β”‚42   β”‚Female   β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚! XXX677662   β”‚
β”‚    β”‚ !  β”‚2888700 β”‚A27834187       β”‚35   β”‚Female   β”‚110          β”‚3           β”‚4            β”‚Czech       β”‚XXX948757     β”‚
β”‚    β”‚ !  β”‚3048201 β”‚P77272017       β”‚39   β”‚Female   β”‚172          β”‚3           β”‚4            β”‚Slovak      β”‚XXX594553     β”‚
β”‚ !  β”‚    β”‚3104872 β”‚K21988419       β”‚43   β”‚Female   β”‚247          β”‚2           β”‚1            β”‚Austrian    β”‚XXX303521     β”‚
β”‚ !  β”‚    β”‚3435893 β”‚705598821       β”‚57   β”‚Female   β”‚140          β”‚3           β”‚3            β”‚Czech       β”‚XXX697294     β”‚
β”‚    β”‚ !  β”‚3590747 β”‚920512139       β”‚44   β”‚Female   β”‚64           β”‚3           β”‚1            β”‚Czech       β”‚XXX825049     β”‚
β”‚ <  β”‚    β”‚4164729 β”‚521221195       β”‚37   β”‚Male     β”‚91           β”‚3           β”‚! 2          β”‚British     β”‚XXX732170     β”‚
β”‚    β”‚ >  β”‚4164729 β”‚521221195       β”‚37   β”‚Male     β”‚91           β”‚3           β”‚! 3          β”‚British     β”‚XXX732170     β”‚
β”‚ <  β”‚    β”‚4684428 β”‚124528923       β”‚36   β”‚Male     β”‚127          β”‚3           β”‚2            β”‚British     β”‚! XXX260038   β”‚
β”‚    β”‚ >  β”‚4684428 β”‚124528923       β”‚36   β”‚Male     β”‚127          β”‚3           β”‚2            β”‚British     β”‚! XXX260039   β”‚
β”‚ !  β”‚    β”‚4890294 β”‚225776255       β”‚5    β”‚Male     β”‚433          β”‚2           β”‚1            β”‚Czech       β”‚XXX889446     β”‚
β•žβ•β•β•β•β•ͺ════β•ͺ════════β•ͺ════════════════β•ͺ═════β•ͺ═════════β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ════════════β•ͺ══════════════║
β”‚(1) β”‚(2) β”‚(ID)    β”‚PASSPORT_NUMBER β”‚AGE  β”‚GENDER   β”‚TICKET_PRICE β”‚TICKET_TYPE β”‚LUGGAGE_TYPE β”‚NATIONALITY β”‚FLIGHT_NUMBER β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Remarks

Notice a few important things:

  • Key columns always come first in the erroneous data sample and are enclosed in brackets. (Notice the (ID) column in the example.)
  • When key is specified, only columns with difference are reported (Notice the SYS_INSERTED_DT column is missing in the second example.)
  • When key is specified, values with the difference have ! prefix. There is always one exclamation mark in each non-key column. (Notice the 73 vs 17 age for ID 2185893.)
  • Different rows (with the same key, but at least one different column) are marked as < (values from first set) and > (values from second set).
  • Note that the number of rows found differs based on whether you specified the key or not. Even though the same Maximum Errors Logged setting was specified. Without the key specified, same row with difference is counted as two (missing on left an missing on right), while with the key specified this counts as one difference.