Data quality is critical in any modern energy organization. Poor data quality directly correlates with poor decisions (garbage in, garbage out, as they say), a problem that becomes even more apparent as AI adoption increases and automated decisions are made continuously and at scale.
Data quality is both an organizational and a technical problem. On the organizational side, it involves clear ownership, shared data definitions, decision authority, and processes for prioritizing and resolving quality issues. On the technical side, it typically involves mechanisms to check, validate, and monitor data quality.
To address the technical aspects of data quality, we typically use tools like Soda, Great Expecations and dbt. These tools allow teams to define explicit expectations about their data and validate those expectations automatically as part of pipelines and workflows.
In this post, we'll explore how Soda Core can be used to define, run, and automate quality checks, and how those checks fit into modern workflows.
Soda Core
Soda Core is an open-source data quality framework designed to run automated checks against data sources (warehouses, databases etc). It is typically executed as a command-line tool and is well suited for non-interactive environments like pipelines and workflow schedulers.
At a high level, Soda Core works by connecting to the data source and running declaratively defined checks to evaluate specific metrics.
Soda also offers a paid service called Soda Cloud, which is a hosted version that adds additional features. We focus on Soda Core here because it is open source, lightweight, and works well as a practical foundation for automated data quality checks.
In the next section, we will define a simple set of checks and run them using Soda Core to illustrate how this works in practice.
A small example
Soda Core works with a variety of data sources. In this example we'll use DuckDB because it requires zero infrastructure, is extremely easy to set up, and is generally awesome. All checks will remain portable to Postgres, Snowflake, Databricks etc.
We're using uv to manage the environment, but you can use whatever tool you prefer.
Let's start by installing the dependencies we need (we'll pin the project to Python 3.11 as Soda Core still requires distutils which was removed in 3.12):
uv python pin 3.11
uv init
uv add "soda-core<4.0" soda-core-duckdb duckdb
You can verify that it was installed correctly by running:
uv run soda
Create a DuckDB database and load the test data
In order to get some data into a DuckDB database that we can test we'll create a small Python script. Download some meter readings from here, save the file in your current directory as meter_readings.csv, and create a file called load.py with the following contents:
import duckdb
con = duckdb.connect("energy.duckdb")
con.execute("""
CREATE OR REPLACE TABLE meter_readings AS
SELECT
meter_id,
timestamp,
consumption_kwh,
now() AS created_at
FROM read_csv_auto(
'meter_readings.csv',
timestampformat='%Y-%m-%d %H:%M:%S'
)
""")
And run it:
uv run python load.py
You should now have a file called energy.duckdb in your current directory.
Connect Soda Core to the data source
Next create a file called config.yaml with the following contents:
data_source energy:
type: duckdb
path: energy.duckdb
read_only: true
That will create a data source named energy. You can verify the connection by running:
uv run soda test-connection -d energy -c config.yaml
Now that we have the basics out of the way, we can start creating “checks”.
Check for missing fields
Let's create a file called checks_meter_readings.yaml (we’ll continue to use this file in all other examples) and add our first checks to verify that no fields are empty:
checks for meter_readings:
# Expect no empty fields
- missing_count(meter_id) = 0
- missing_count(timestamp) = 0
- missing_count(consumption_kwh) = 0
Let's run a scan with Soda and see what happens:
uv run soda scan -d energy -c config.yaml checks_meter_readings.yaml
This tells Soda Core to use the connection energy in config.yaml and scan the data using the checks in checks_meter_readings.yaml. Here's the output:
[17:47:33] Soda Core 3.5.6
[17:47:33] Using DefaultSampler
[17:47:33] Scan summary:
[17:47:33] 2/3 checks PASSED:
[17:47:33] meter_readings in energy
[17:47:33] missing_count(meter_id) = 0 [PASSED]
[17:47:33] missing_count(timestamp) = 0 [PASSED]
[17:47:33] 1/3 checks FAILED:
[17:47:33] meter_readings in energy
[17:47:33] missing_count(consumption_kwh) = 0 [FAILED]
[17:47:33] check_value: 1
[17:47:33] Oops! 1 failures. 0 warnings. 0 errors. 2 pass.
Soda's telling us that there's an issue with our data. A row is missing a consumption_kwh value. Good to know.
Check uniqueness
Another very useful thing to check is uniqueness. Let's make sure there's only one entry per meter and timestamp in our dataset:
checks for meter_readings:
# ...
# Expect no duplicate entries
- duplicate_count(meter_id, timestamp) = 0
If you re-run the scan you'll notice that there's a duplicated row in the data.
Check cardinality
You can of course also create custom tests. In this check, we ensure that there are exactly five distinct meter IDs in the data, as that is what we expect:
checks for meter_readings:
# ...
# Expect exactly five distinct meters
- distict_meters = 5:
distict_meters query: |
SELECT COUNT(DISTINCT meter_id)
FROM meter_readings
Check consumption validity
Next we'll make sure that the consumption values fall within a reasonable domain:
checks for meter_readings:
# ...
# Expect reasonable consumption values
- min(consumption_kwh) >= 0
- max(consumption_kwh) <= 10
Check freshness
Here we'll introduce a freshness check using the created_at column. We also introduce the concept of "thresholds" which instruct Soda Core how to behave when different conditions are met:
checks for meter_readings:
# ...
# Ensure that the data is fresh
- freshness(created_at):
warn: when > 1d
fail: when > 2d
The check will warn if there's no data newer than a day, and fail if there's no data newer than two days.
Check format conformance
Another useful thing to verify is that values follow specific formats. In this check, we verify that meter_id values follow the METER_### format:
checks for meter_readings:
# ...
# Enforce a consistent meter_id format
- invalid_count(meter_id) = 0:
name: valid identifier
valid regex: ^METER_[0-9]{3}$
Check completeness
You can use row_count to see how many rows are present in the dataset and verify that a reasonable number of readings has been received. In these examples, we run checks against the entire dataset. In a real scenario, you would define expected volumes for the specific slice of data being validated.
checks for meter_readings:
# ...
# Expect roughly 24*5 rows per day, allowing for minor variation.
- row_count between 118 and 122
Final result
This is the final file with all the checks:
checks for meter_readings:
# Expect no empty fields
- missing_count(meter_id) = 0
- missing_count(timestamp) = 0
- missing_count(consumption_kwh) = 0
# Expect no duplicate entries
- duplicate_count(meter_id, timestamp) = 0
# Expect exactly five distinct meters
- distict_meters = 5:
distict_meters query: |
SELECT COUNT(DISTINCT meter_id)
FROM meter_readings
# Expect reasonable consumption values
- min(consumption_kwh) >= 0
- max(consumption_kwh) <= 10
# Ensure that the data is fresh
- freshness(created_at):
warn: when > 1d
fail: when > 2d
# Enforce a consistent meter_id format
- invalid_count(meter_id) = 0:
name: valid identifier
valid regex: ^METER_[0-9]{3}$
# Expect roughly 24*5 rows per day, allowing for minor variation.
- row_count between 118 and 122
And this is the output when we run them all at once:
[23:05:46] Soda Core 3.5.6
[23:05:46] Using DefaultSampler
[23:05:46] Using DefaultSampler
[23:05:46] Scan summary:
[23:05:46] 6/10 checks PASSED:
[23:05:46] meter_readings in energy
[23:05:46] distict_meters = 5 [PASSED]
[23:05:46] row_count between 118 and 122 [PASSED]
[23:05:46] missing_count(meter_id) = 0 [PASSED]
[23:05:46] valid identifier [PASSED]
[23:05:46] missing_count(timestamp) = 0 [PASSED]
[23:05:46] freshness(created_at) warn when > 1d fail when > 2d [PASSED]
[23:05:46] 4/10 checks FAILED:
[23:05:46] meter_readings in energy
[23:05:46] duplicate_count(meter_id, timestamp) = 0 [FAILED]
[23:05:46] check_value: 1
[23:05:46] missing_count(consumption_kwh) = 0 [FAILED]
[23:05:46] check_value: 1
[23:05:46] min(consumption_kwh) >= 0 [FAILED]
[23:05:46] check_value: -1.4
[23:05:46] max(consumption_kwh) <= 10 [FAILED]
[23:05:46] check_value: 1180.0
[23:05:46] Oops! 4 failures. 0 warnings. 0 errors. 6 pass.
From this output, we can see that Soda Core detected a duplicate row, a missing consumption_kwh value, a negative consumption_kwh value, and a consumption_kwh value far outside the expected range.
Wrap-up
What we’ve shown here is intentionally simple, but it is a setup that scales.
By defining version-controlled data quality checks and running them automatically and continuously, teams and organizations will have a much better time working with data. Issues surface early, close to where they are introduced, and can block downstream processing before bad data can propagate.