Contains
Contains expectation verifies one set contains all rows of another set.
Example
In this example we check all data from MySQL source system are present in a staging table in SQL server. The staging table in this example is persistent and is expected to contain more rows than the source (those that were deleted from the source already, but are kept in the staging table). The goal is to check, that every single row from source is present in the staging and we are not missing any rows.
Tests:
- Suite: Incremental load tests
Name: We have all customers from CRM
First Data Source: DWH # this data source is defined with SqlServer@1 provider
First Query: |
SELECT id, name
FROM StagingCrm.Customer
WHERE created_on < CAST(GETDATE() AS DATE)
ORDER BY id
Second Data Source: CRM # this data source is defined with MySql@1 provider
Second Query: |
SELECT id, name
FROM contacts.customer
WHERE created_on < curdate()
ORDER BY id
Expectation: contains
Key: id
Result
The test result is Passed
when:
-
The first set contains all rows from the second set. (The first set may contain also additional rows, missing in the second.)
-
The second set is empty (and there are no errors in the queries).
The test result is Failed
when the second set contains at least one row missing in the first set.
If the underlying provider returns an exception for any of the sets, the result will be Error
.
Remarks
Key and More Rows in the Error Message
When Maximum Errors Logged
is set to 1 (default) or 0, CAT closes the connection to the underlying provider immediately after it finds out there is a row int the second set that is missing in the first set. This is intentional, CAT is meant to be an “indicator” that points to problems, not a tool that returns all rows that do not satisfy a rule. You can raise Maximum Errors Logged
to a higher value. Maximum is 50 - it is enough to spot patterns in erroneous data and fix the problems.
Tests:
- Suite: Incremental load tests
Name: We have all customers from CRM
#
# abbreviated
#
Expectation: contains
Key: id # Optional, but highly recommended. Unique and sorted.
Maximum Errors Logged: 20 # works only if key is set
Maximum Errors Logged
setting to a value higher than one, you need to specify a key. In the Key
setting, provide the name of the column that is unique. The column name can be from either of the sets, but the values in it must be unique. Both sets need to be ordered by that column.Tests with Contains
expectation perform better and provide much better error messages when you set the Key
. If the key is not set, all still works, but you’ll only see the first erroneous row in the error message and the peformance might be lower.
Both key columns must have the same data type. Name can be different (use name from either first or second set), but the data type must be the same. So if you have e.g., int
in first set and bigint
in the second, cast both to bigint
.
Oposit direction
The Contains
expectation, by default, expects the superset to be in the first query and the subset to be in the second query. This may be sometimes awkward and unintuitive. You can override that by changing the expectation name:
Expectation: second contains first
You can also be more specific with the default “direction”. These are technically the same:
Expectation: first contains second
Expectation: contains