Contains

Contains expectation verifies one set contains all rows of another set.

Example

In this example we check all data from MySQL source system are present in a staging table in SQL server. The staging table in this example is persistent and is expected to contain more rows than the source (those that were deleted from the source already, but are kept in the staging table). The goal is to check, that every single row from source is present in the staging and we are not missing any rows.

Tests:
- Suite: Incremental load tests
  Name: We have all customers from CRM
  First Data Source: DWH # this data source is defined with SqlServer@1 provider
  First Query: |
    SELECT  id, name
    FROM    StagingCrm.Customer
    WHERE   created_on < CAST(GETDATE() AS DATE)
    ORDER BY id
  Second Data Source: CRM # this data source is defined with MySql@1 provider
  Second Query: |
    SELECT  id, name
    FROM    contacts.customer
    WHERE   created_on < curdate()
    ORDER BY id
  Expectation: contains
  Key: id

🐱

Important: both the sets need to be ordered. Why? Databases are optimized for getting data in order and it allows CAT to evaluate the expectation in very fast and efficient way.

Result

The test result is Passed when:

The first set contains all rows from the second set. (The first set may contain also additional rows, missing in the second.)
The second set is empty (and there are no errors in the queries).

The test result is Failed when the second set contains at least one row missing in the first set.

If the underlying provider returns an exception for any of the sets, the result will be Error.

Remarks

Key and More Rows in the Error Message

When Maximum Errors Logged is set to 1 (default) or 0, CAT closes the connection to the underlying provider immediately after it finds out there is a row int the second set that is missing in the first set. This is intentional, CAT is meant to be an “indicator” that points to problems, not a tool that returns all rows that do not satisfy a rule. You can raise Maximum Errors Logged to a higher value. Maximum is 50 - it is enough to spot patterns in erroneous data and fix the problems.

Tests:
- Suite: Incremental load tests
  Name: We have all customers from CRM
  #
  # abbreviated
  #
  Expectation: contains
  Key: id # Optional, but highly recommended. Unique and sorted.
  Maximum Errors Logged: 20 # works only if key is set

🐱

If you want to raise the Maximum Errors Logged setting to a value higher than one, you need to specify a key. In the Key setting, provide the name of the column that is unique. The column name can be from either of the sets, but the values in it must be unique. Both sets need to be ordered by that column.

Tests with Contains expectation perform better and provide much better error messages when you set the Key. If the key is not set, all still works, but you’ll only see the first erroneous row in the error message and the peformance might be lower.

Both key columns must have the same data type. Name can be different (use name from either first or second set), but the data type must be the same. So if you have e.g., int in first set and bigint in the second, cast both to bigint.

Oposit direction

The Contains expectation, by default, expects the superset to be in the first query and the subset to be in the second query. This may be sometimes awkward and unintuitive. You can override that by changing the expectation name:

Expectation: second contains first

You can also be more specific with the default “direction”. These are technically the same:

Expectation: first contains second
Expectation: contains

Contains

Example #

Result #

Remarks #

Key and More Rows in the Error Message #

Oposit direction #

Example

Result

Remarks

Key and More Rows in the Error Message

Oposit direction