Data quality testing with Soda Core

Jan 29, 2026 6 min read

Published

Jan 29, 2026

Area

Data / ML / AI

Soda Core

Soda Core is an open-source data quality framework designed to run automated checks against data sources (warehouses, databases etc). It is typically executed as a command-line tool and is well suited for non-interactive environments like pipelines and workflow schedulers.

At a high level, Soda Core works by connecting to the data source and running declaratively defined checks to evaluate specific metrics.

Soda also offers a paid service called Soda Cloud, which is a hosted version that adds additional features. We focus on Soda Core here because it is open source, lightweight, and works well as a practical foundation for automated data quality checks.

In the next section, we will define a simple set of checks and run them using Soda Core to illustrate how this works in practice.

A small example

Soda Core works with a variety of data sources. In this example we'll use DuckDB because it requires zero infrastructure, is extremely easy to set up, and is generally awesome. All checks will remain portable to Postgres, Snowflake, Databricks etc.

We're using uv to manage the environment, but you can use whatever tool you prefer.

Let's start by installing the dependencies we need (we'll pin the project to Python 3.11 as Soda Core still requires distutils which was removed in 3.12):

uv python pin 3.11
uv init
uv add "soda-core<4.0" soda-core-duckdb duckdb

You can verify that it was installed correctly by running:

uv run soda

Create a DuckDB database and load the test data

In order to get some data into a DuckDB database that we can test we'll create a small Python script. Download some meter readings from here, save the file in your current directory as meter_readings.csv, and create a file called load.py with the following contents:

import duckdb

con = duckdb.connect("energy.duckdb")

con.execute("""
CREATE OR REPLACE TABLE meter_readings AS
SELECT
  meter_id,
  timestamp,
  consumption_kwh,
  now() AS created_at
FROM read_csv_auto(
  'meter_readings.csv',
  timestampformat='%Y-%m-%d %H:%M:%S'
)
""")

And run it:

uv run python load.py

You should now have a file called energy.duckdb in your current directory.

Connect Soda Core to the data source

Next create a file called config.yaml with the following contents:

data_source energy:
  type: duckdb
  path: energy.duckdb
  read_only: true

That will create a data source named energy. You can verify the connection by running:

uv run soda test-connection -d energy -c config.yaml

Now that we have the basics out of the way, we can start creating “checks”.

Check for missing fields

Let's create a file called checks_meter_readings.yaml (we’ll continue to use this file in all other examples) and add our first checks to verify that no fields are empty:

checks for meter_readings:
  # Expect no empty fields
  - missing_count(meter_id) = 0
  - missing_count(timestamp) = 0
  - missing_count(consumption_kwh) = 0

Let's run a scan with Soda and see what happens:

uv run soda scan -d energy -c config.yaml checks_meter_readings.yaml

This tells Soda Core to use the connection energy in config.yaml and scan the data using the checks in checks_meter_readings.yaml. Here's the output:

[17:47:33] Soda Core 3.5.6
[17:47:33] Using DefaultSampler
[17:47:33] Scan summary:
[17:47:33] 2/3 checks PASSED:
[17:47:33]     meter_readings in energy
[17:47:33]       missing_count(meter_id) = 0 [PASSED]
[17:47:33]       missing_count(timestamp) = 0 [PASSED]
[17:47:33] 1/3 checks FAILED:
[17:47:33]     meter_readings in energy
[17:47:33]       missing_count(consumption_kwh) = 0 [FAILED]
[17:47:33]         check_value: 1
[17:47:33] Oops! 1 failures. 0 warnings. 0 errors. 2 pass.

Soda's telling us that there's an issue with our data. A row is missing a consumption_kwh value. Good to know.

Check uniqueness

Another very useful thing to check is uniqueness. Let's make sure there's only one entry per meter and timestamp in our dataset:

checks for meter_readings:
  # ...
  
  # Expect no duplicate entries
  - duplicate_count(meter_id, timestamp) = 0

If you re-run the scan you'll notice that there's a duplicated row in the data.

Check cardinality

You can of course also create custom tests. In this check, we ensure that there are exactly five distinct meter IDs in the data, as that is what we expect:

checks for meter_readings:
  # ...
  
  # Expect exactly five distinct meters
  - distict_meters = 5:
      distict_meters query: |
        SELECT COUNT(DISTINCT meter_id)
        FROM meter_readings

Check consumption validity

Next we'll make sure that the consumption values fall within a reasonable domain:

checks for meter_readings:
  # ...
  
  # Expect reasonable consumption values
  - min(consumption_kwh) >= 0
  - max(consumption_kwh) <= 10

Check freshness

Here we'll introduce a freshness check using the created_at column. We also introduce the concept of "thresholds" which instruct Soda Core how to behave when different conditions are met:

checks for meter_readings:
  # ...
  
  # Ensure that the data is fresh
  - freshness(created_at):
      warn: when > 1d
      fail: when > 2d

The check will warn if there's no data newer than a day, and fail if there's no data newer than two days.

Check format conformance

Another useful thing to verify is that values follow specific formats. In this check, we verify that meter_id values follow the METER_### format:

checks for meter_readings:
  # ...
  
  # Enforce a consistent meter_id format
  - invalid_count(meter_id) = 0:
      name: valid identifier
      valid regex: ^METER_[0-9]{3}$

Check completeness

You can use row_count to see how many rows are present in the dataset and verify that a reasonable number of readings has been received. In these examples, we run checks against the entire dataset. In a real scenario, you would define expected volumes for the specific slice of data being validated.

checks for meter_readings:
  # ...
  
  # Expect roughly 24*5 rows per day, allowing for minor variation.
  - row_count between 118 and 122

Final result

This is the final file with all the checks:

checks for meter_readings:
  # Expect no empty fields
  - missing_count(meter_id) = 0
  - missing_count(timestamp) = 0
  - missing_count(consumption_kwh) = 0

  # Expect no duplicate entries
  - duplicate_count(meter_id, timestamp) = 0

  # Expect exactly five distinct meters
  - distict_meters = 5:
      distict_meters query: |
        SELECT COUNT(DISTINCT meter_id)
        FROM meter_readings

  # Expect reasonable consumption values
  - min(consumption_kwh) >= 0
  - max(consumption_kwh) <= 10

  # Ensure that the data is fresh
  - freshness(created_at):
      warn: when > 1d
      fail: when > 2d
  
  # Enforce a consistent meter_id format
  - invalid_count(meter_id) = 0:
      name: valid identifier
      valid regex: ^METER_[0-9]{3}$
      
  # Expect roughly 24*5 rows per day, allowing for minor variation.
  - row_count between 118 and 122

And this is the output when we run them all at once:

[23:05:46] Soda Core 3.5.6
[23:05:46] Using DefaultSampler
[23:05:46] Using DefaultSampler
[23:05:46] Scan summary:
[23:05:46] 6/10 checks PASSED:
[23:05:46]     meter_readings in energy
[23:05:46]       distict_meters = 5 [PASSED]
[23:05:46]       row_count between 118 and 122 [PASSED]
[23:05:46]       missing_count(meter_id) = 0 [PASSED]
[23:05:46]       valid identifier [PASSED]
[23:05:46]       missing_count(timestamp) = 0 [PASSED]
[23:05:46]       freshness(created_at) warn when > 1d fail when > 2d [PASSED]
[23:05:46] 4/10 checks FAILED:
[23:05:46]     meter_readings in energy
[23:05:46]       duplicate_count(meter_id, timestamp) = 0 [FAILED]
[23:05:46]         check_value: 1
[23:05:46]       missing_count(consumption_kwh) = 0 [FAILED]
[23:05:46]         check_value: 1
[23:05:46]       min(consumption_kwh) >= 0 [FAILED]
[23:05:46]         check_value: -1.4
[23:05:46]       max(consumption_kwh) <= 10 [FAILED]
[23:05:46]         check_value: 1180.0
[23:05:46] Oops! 4 failures. 0 warnings. 0 errors. 6 pass.

From this output, we can see that Soda Core detected a duplicate row, a missing consumption_kwh value, a negative consumption_kwh value, and a consumption_kwh value far outside the expected range.

Wrap-up

What we’ve shown here is intentionally simple, but it is a setup that scales.

By defining version-controlled data quality checks and running them automatically and continuously, teams and organizations will have a much better time working with data. Issues surface early, close to where they are introduced, and can block downstream processing before bad data can propagate.

Let's explore your challenges

We help you go further, move faster, and accomplish more. Our specialized teams are ready to tackle your toughest challenges. Let's start a conversation and explore how we can help.

Schedule meeting