Complete example

Complex example of a CAT project file

😺

Please also consider using templates for new projects. See how.

This example shows a few data sources, a few tests, how to generate tests from metadata, load test definition from external sources, setting multiple outputs, configure CAT to run in more threads and many more.

😺

The example is intentionally complex to showcase the various settings you can configure. In a real project, you’ll likely use only a small subset of these settings. Treat this example more as a reference for what is possible rather than inspiration for how a real project should look.

Presenting many CAT features in a single file while keeping the example meaningful is challenging. Instead of focusing on whether the tests make sense, focus on the configuration possibilities this example demonstrates.

We’ve often received feedback that our documentation includes only short, focused snippets, which can make it hard for beginners to see how everything fits together. This example gives you the whole picture. Simply remove the parts that aren’t relevant to your project!

Data sources: # define one or more data sources
- Name: FlightsSystem # friendly name, you use this in test definitions to refer to this data source
  Provider: SqlServer@2 # see Providers in the docs
  Connection string: >
    Server=sql-aero-flights-prod.database.windows.net;
    Authentication=Active Directory Service Principal;
    Database=sqldb-aero-flights-prod;
    User Id=%FLIGHTS_TESTING_PRINCIPAL_ID%;
    Password=%FLIGHTS_TESTING_PRINCIPAL_SECRET%;
- Name: PassengersSystem
  Provider: Postgres@1
  Connection string: "%PASSENGERS_CONNECTION_STRING%"

Named sets: # here you can define metadata for generating tests
- Name: all gates # in this case we want to generate one test for each Gate we have
  Data source: AERO_PROD # must be name of an existing data source
  Query: |
    SELECT  GATE_ID, GATE_NUMBER
    FROM    DIM.GATES

# You can instruct CAT to load definition of data sources or tests from other sources
Get list of data sources from:
# in this case we want to load data sources from directory DataSources/PROD.yaml (we use environment variable to change the location at runtime)
- Provider: Yaml@1
  Connection string: "./DataSources/%ENVIRONMENT%.yaml"
  Query: /Data sources # this is name of top-level node in the YAML file

Get list of tests from:
# we can load tests from database, either from database table or procedure
- Provider: SqlServer@1
  Connection string: data source=localhost;integrated security=true;initial catalog=Testing
  Query: SELECT * FROM [Test] WHERE Skipped = 0
  # or you can get list of tests from a procedure:
  # Query: EXEC Testing.GetListOfTests @ENVIRONMENT='DEV'
# but we also want to load tests defined in MS Excel file:
- Provider: Excel@1
  Connection string: MyTests.xlsx # file in the same directory as your project file
  Query: SELECT * FROM [Sheet1]

# Tests and data sources can also be defined directly in your project file:
Tests:
- Test suite: Consistency checks # optinal, for hierarchical categorization
  Test case: flights vs passengers # optinal, for hierarchical categorization
  Order: 1 # this is metadata only, it does not influence anything
  Name: Number of passengers is consistent between systems # short friendly name
  # Description is optional but highly recommended:
  Description: >
    This verifies that the flight system passengers table
    does not significantly differ in count of rows from
    the passengers table in the passengers system.
  First Data Source: FlightsSystem
  First Query: |
      SELECT  COUNT(*)
      FROM    dbo.Passenger
      WHERE   DeletedFlag = 0
  Second Data source: PassengersSystem
  Second Query: |
      SELECT COUNT(*) FROM public.passenger
  Expectation: sets match # set is empty, set is not empty, set rowcount, sets match, contains
  Tolerance: 5.0 # this means 5.0 % (so it is exactly 5.0 when the base is 100)
  Tolerance mode: percent # you can also use 'absolute'
  Tags: DevOnly # comma or semicolon separated, optional

- Test suite: Migration to cloud checks
  Test case: Passengers data
  Name: Check important columns in passengers table
  Description: >
    Verifies the passengers data vere completely migrated, including
    correct non-english characters etc.
  First data source: PassengersSystem
  First Query: |
      SELECT p.passenger_id, p.first_name, p.last_name, p.passport_number, p.date_of_birth
      FROM public.passenger as p
        LEFT OUTER_JOIN public.passengers_do_not_migrate as dnm on p.passenger_id = dnm.passenger_id
      WHERE dnm.passenger_id IS NULL
      ORDER BY p.passenger_id
  Second data source: FlightsSystem
      SELECT PassengerID, FirstName, LastName, PassportNumber, DateOfBirth
      FROM Dim.Passenger
      WHERE IsActive = 1
      ORDER BY PassengerID
  Expectation: sets match
  Key: PassengerID # Not obligatory but recommended.
  Tags: Consistency checks

- Name: Update of table
  Test Suite: Timeliness Data Quality Dimension
  Description: >
    This is to check if the staging table, with
    the column LastProcessDay, was updated today
  Data Source: DWH
  Query: |
        SELECT TOP(5) *
        FROM Staging.Shipment
        WHERE LastProcessDay > CAST(GETDATE() AS DATE);
  Expectation: set is not empty
  Tags: Monitoring

- Name: Actual flights for gate %GATE_NUMBER% are loaded
  Test Suite: BusinessTests
  Description: Flights in FACT.DEPARTURES table for gate %GATE_NUMBER% are loaded.
  Data Source: AERO_PROD
  Query: |
    SELECT    *
    FROM      FACT.DEPARTURES AS d
    WHERE     d.DATE_OF_FLIGHT BETWEEN GETDATE() AND DATEADD(DAY, 14, d.DATE_OF_FLIGHT)
        AND d.GATE_ID = '%GATE_ID%'
  Expectation: set is not empty
  Metadata: all gates # this tells CAT this is a "template" for metadata driven tests
  # CAT will generate a new test for each row returned by query in named set 'all gates'
  # values like '%GATE_ID%' will be replaced by respective column value in each row.

- Name: There are no new errors
  Description: >
    Checks there are no new errors in the logging table.
  Data Source: DWH
  Query: |
        SELECT *
        FROM Audit.LogIngestError
        WHERE IsResolved = 0
  Expectation: set is empty
  Maximum errors logged: 50
  Log number of errors: true
  Tags: Monitoring

# short syntax for outputs:
# Output: xlsx, json, junit

Output:
- File: LatestTestResults.xlsx
  Format: xlsx
- File: TestResults/TestResults-{timestamp}.xml
  Format: junit
- Database: FlightsSystem # name of a defined data source
  Table: Tests.CatTestResults
  # or
  # Procedure: dbo.SaveCatTestResult


Threads: 2

You will find all necessary details for everything in our documentaiton. Still, here is a collection of brief tips:

You can define more data sources.
Both data sources use environment variables to hide secrets. FlightsSystem data source hides only service principal ID and secret. PassengersSystem data source hides the entire connection string. CAT will automatically pick up values from environment variables FLIGHTS_TESTING_PRINCIPAL_ID, FLIGHTS_TESTING_PRINCIPAL_SECRET and PASSENGERS_CONNECTION_STRING and use those values.
Note the first test uses Test suite, Test case and Name. This is hierarchy of tests in CAT. One suite has more test cases, one test cases has more tests (test names). Synonyms and various naming conventions are supported, e.g. for suite you can use TEST_SUITE, test suite, suite etc. Also Order is used - it has no influence on anything in CAT and is used only for identification purposes (it is pure metadata of a test). CAT enforces uniqueness on combinaion of suite, test case, order and name (so called “test full name”).
The first test uses Description. It is not mandatory, but it is highly recommended to always use it. It should contain information about why the test exists, what it checks, what to do if it fails.
The first test compares data from different systems. It counts rows from tables in Azure SQL database and from local PostgreSQL database. The test will pass if the difference is lower than 5 percent (the lower number is used as base). When tolerance is set, it applies to all numeric columns. It is for convenience, if you need different tolerance for different columns, split your test to more tests with different tolerance settings (checking different columns with different tolerance).
The second test compares data across systems. It checks important columns for passengers tables. The sets match requires the data to be sorted, note both queries have ORDER BY clause. CAT can compare data without knowing which column is a unique key. However, specifying a unique key using the Key property improves performance and makes error messages more helpful. The name of the key column can be taken from first or second set, it does not matter. CAT does NOT do any mapping based on column names - you can use different column names in queries. CAT on the other hannd DOES require the columns come in the same order in both queries.
In Outputs sections we defined we want results in MS Excel format to file LatestTestResults.xlsx, JUnit XML file (for Azure DevOps) in a specific directory and with a timestamp in the name. We also defined that we want to output results of each test into a database table. If the table does not exist, CAT will attempt to create it. Schema must exist.
Threads setting means CAT will be evaluating tests in two threads. There will be a queue of ALL tests. Each thread will pick up next test from a queue, remove it from the queue and evaluate it (independently on other threads). Each thread will end once it finds out the queue is already empty.
Note Tags property. For each test, you can define one or more tags, separated by comman or semicolon. Tags are optional, you don’t have to specify them. Tags can contain spaces. You can include/exclude tests with or without tags when you are running tests from your projects.
The second to last test is metadata driven. CAT will execute query defined in named set ‘all gates’. For each row, it will create a test. All values between percent will be replaced with a value of that column for the respective row (if such column exist).
The last test uses set is empty expectation. Note we used Maximum errors logged to tweak how many rows we will get in the sample of erroneous rows (1 by default, 50 maximum, 0 for no data in message). That works for all expectations. There is one more setting that works only for set is empty expectation: log number of errors. That will enrich the output with total number of errors (just a number). This is different - how many erroneous rows I see in the message (maximum errors logged) vs how many errors exist (log number of errors).
The example also demonstrates how you can retrieve test definitions from other sources, such as from relational database table or procedure or from MS Excel sheet. Whatever data you are able to test with CAT can also store test definitions for you. CAT expects same names for columns (in database table/procedure or MS Excel Sheet or CSV or whatever) you are used to from YAML.