Introduction

Excel@2 provider lets you run complex SQL queries against data in MS Excel files.

Excel@2 works in a slightly different way than the other CAT providers. Because MS Excel is not a database, CAT first loads the MS Excel data into an in-memory database - for easier querying and testing. This “copy” of data in memory is removed when CAT process ends.

Define Data Source

In the Connection string, you provide full path to a .xlsx file. Older .xls format is NOT supported. Here is an example of a data source:

Data Sources:
- Name:               MyExcelData     # every data source must have a friendly name
  Provider:           Excel@2         # don't confuse with Excel@1 and ExcelOleDB@1
  Connection string:  ./2023.xlsx     # Relative or absolute file path.
                                      # In this case CAT expects the file "next to" your project file
  Sheets:             sheet1, sheet2  # optional, if you don't want to load all sheets
  Normalize column names: true        # optional, remove spaces and non-English characters from column names

In the example above, CAT will create two tables in in-memory database: "MyExcelData"."sheet1" and "MyExcelData"."sheet2". The naming convention is "DataSourceName"."SheetName".

Here is a list of all settings you can use:

Setting name Required Default Explanation
Name yes every data source must have a friendly name
Provider yes tells CAT to use this provider
Connection string yes absolute or relative path to your xlsx file (relative paths are resolved against the directory with your .cat.yaml project file)
Sheets no Optional comma or semicolon separated names of sheets. Case insensitive. All sheets are loaded if this is not specified. Error occurs when you specify a non-existing sheet name.
Normalize column names no false If true, cat normalizes column names (see below)
Normalize table names no false If true, cat normalizes table names (see below)
All varchar no false CAT does not guess data types and imports all data as VARCHARs
Headers no true If true, CAT considers the first row as headers, otherwise all is data

Sheets is now optional. If you don’t specify the setting, CAT will load all the sheets. Completely empty sheets (without even headers) are skipped. If you want to test only part of the workbook, specify the sheet names, CAT will then not have to load unnecessary data.

All varchar is for troubleshooting. CAT by default tries to guess correct data types. If CAT finds a value that does not “fit” to guessed data type, it changes the column to VARCHAR, it is designed to never fail to load the data. This way you can e.g., write tests for expected data types. You can use DuckDB’s TRY_CAST function to find values that break your rules.

Normalizing names

Some MS Excel worksheet names and headers contain characters that make querying them very inconvenient, especially when you need to write lots of SQL queries. CAT can normalize the names for you by removing all diacritics, replacing spaces with underscores and replacing consecutive underscores with one underscore.

Please note that you need to use original names of sheets in Sheets setting.

Using SQL

In our example, you can write this simple test:

Tests:
- Name:         Check at least something was generated
  Suite:        Smoke tests
  Description:  Sometimes happens the generated sheet is empty, this should alert us
  Data source:  MyExcelData # name of the data source
  Query:        SELECT * FROM "MyExcelData"."Sheet1" LIMIT 1
  Expectation:  set is not empty

Remember all worksheets from all data sources you specified are available as tables in the tests. Feel free to combine them using all kinds of joins or set operations like UNION, UNION ALL, EXCEPT, INTERSECT etc.

The data from sheets is loaded with types (numbers, dates, booleans, …). This is useful because you can also use all kind of functions - for working with date and time, strings, numbers. You can use pattern matching, regular expressions, …

SQL syntax supports almost all you can imagine: aggregate functions, common table expressions (CTEs), windowing functions, grouping sets, pivot and unpivot, corellated subqueries, …

See subsections (in the left menu) for details, such as SELECT syntax. For complete reference what constructs are supported in the queries, refer to this documentation:

https://duckdb.org/docs/stable/sql/query_syntax/select.

Limitations

This provider does NOT offer the “strike-through” feature (only Excel@1 supports this). Please be aware that lines with strike-through formatting will be processed as any other rows.

Prerequisites

Install Microsoft Visual C++ Redistributable

This is a very common prerequisite, used by many other softwares. So chances are good you already have it. Anyway, follow these instructions to verify it or install.

Go to this Microsoft webpage: Latest Microsoft Visual C++ Redistributable Version

Download and install “X64” version:

Microsoft Visual C++ Redistributable web download link

If you get a dialog with “Repair”, “Uninstall” and “Close” buttons, just close it - you already have this prerequisite. Otherwise install it (next, next, next…).

Internet connection

DuckDB loads the extensions from this URL: http://extensions.duckdb.org. This needs to be accessible, so ensure you are connected to the Internet and white-list this URL if needed.

If you need to be able to test CSV data when offline, please file a request using the “Get Help” button on the right on this site.

History

You may wonder why we have more drivers for MS Excel. Mainly for historical reasons, we support more ways to test MS Excel workbooks / worksheets data. This provider, Excel@2, aims to replace all the old ones completely. Both Excel@1 and ExcelOleDB@1 had their drawbacks. Those are addressed in this provider.

Excel@2:

  • Has only one minor prerequisite, usually already installed on many computers.

  • It allows you to use advanced SQL constructs in queries you run against data in MS Excel worksheets (such as joins, grouping, CTEs and it also comes with many functions etc.)

  • It is based on a maintained and well documented technology - DuckDB.