Data Science & MLOps: Kedro for Reliable ML Workflows
So, you’ve built a great machine learning (ML) model. It works in your notebook, your metrics look solid, and you’re feeling pretty good. But what happens when you try to take that model into production? Suddenly, it’s like crossing a bridge with missing planks—everything starts falling apart.
This isn’t just you. ML workflows are tricky, and they come with challenges that are not typically encountered in traditional software engineering. Today, I’m going to break down these challenges and introduce you to Kedro, a tool that can help you streamline the organization of your ML code, bringing you one step closer to production-ready ML systems.
1. Challenges of ML workflows
Data science problems rarely start with clear boundaries. You’re often handed an ill-defined question, and as you explore the data, new patterns emerge, new ideas surface, and suddenly, the problem itself shifts. Before you know it, you’re tweaking models and target metrics, and rewrite pipelines. This is because data science is highly experimental by nature: it’s less about writing perfect, modular code and more about prototyping fast and seeing what works. But this comes with a cost: pipelines get messy, modularity goes out the window, and suddenly maintaining or scaling the project becomes painful.
On top of this experimental nature, you need to keep track of datasets, model parameters, hyperparameters, and experiment results—all of which can drastically impact outcomes. Update a hyperparameter? Your model’s performance shifts entirely. Git can handle your code, sure, but it wasn’t built to version datasets, metrics, or experiment states.
The result? ML workflows quickly become a tangle of notebooks, scripts, and results, with no easy way to trace what changed, when it changed, or why things suddenly stopped working. This is where MLOps tools like Kedro and MLflow step in. They’re built to bring structure to the chaos—helping you build reproducible workflows and making it easier to experiment, iterate, and scale without losing your mind.
In this post, I’ll focus on Kedro and how it simplifies building modular pipelines, managing data, and centralizing parameters.
2. Kedro: Bringing Structure to ML Workflows
Kedro is an open-source Python framework that helps you build clean, modular, and reproducible ML workflows. Think of it as applying software engineering best practices to your data science projects.
Kedro allows you to decouple three critical components of your ML worfklow:
- The Code: Modular and reusable pipelines that cleanly separate each task.
- The Data Sources: Managed through a centralized Data Catalog, which encode “where and how to read/write things”.
- The Parameters: Defined in dedicated parameter files, keeping your hyperparameters clean, versioned, and easy to tweak (even at runtime, I’ll show you later!)
This separation is foundational to Kedro. By keeping code, data sources, and parameters independent, your workflow becomes flexible, scalable, and production-ready. Swap datasets, tweak hyperparameters, or move to production—no pipeline code changes needed.
Plus, this structure adds clarity to ML projects, making pipelines easier to understand, maintain, and debug. New team members can be onboarded without wading through scattered scripts or notebooks.
Let’s break down how it all works and why it matters.
2.1 The Code
Kedro encourages you to break your workflow into small, focused functions—each doing one thing well. These functions can then be organized as nodes in a directed acyclic graph (DAG), or what we commonly call a “pipeline.” Building the pipeline is simple: you list your nodes in a pipeline object, and Kedro takes care of the rest. It automagically figures out the dependencies between nodes, ensuring everything runs in the right order.
Let’s look at an example. Say you’re building a pipeline to clean some data. Instead of cramming all the cleaning steps into one massive function, let’s define each step properly
import pandas as pd
def fill_na_with_zeros(data: pd.DataFrame) -> pd.DataFrame:
"""Fill missing values with zeros."""
return data.fillna(0)
def clip_outliers(data: pd.DataFrame, lower: float, upper: float) -> pd.DataFrame:
"""Clip outliers in the dataset based on given thresholds."""
return data.clip(lower=lower, upper=upper)
def normalize_data(data: pd.DataFrame) -> pd.DataFrame:
"""Normalize the data between 0 and 1."""
return (data - data.min()) / (data.max() - data.min())
Here, each function (or node) does one thing and one thing only—a classic software engineering best practice. Now, let’s define a pipeline that connects these tasks:
from kedro.pipeline import Pipeline, node
def create_pipeline(**kwargs):
return Pipeline(
[
node(fill_na_with_zeros, inputs="raw_data", outputs="data_no_na"),
node(
clip_outliers,
inputs=["data_no_na", "params:clip.lower", "params:clip.upper"],
outputs="data_clipped",
),
node(normalize_data, inputs="data_clipped", outputs="cleaned_data"),
]
)
Let’s break this down step by step to understand what’s happening in the pipeline creation and how Kedro organizes your workflow!
Each function—fill_na_with_zeros
, clip_outliers
, and normalize_data
—is wrapped into a node. A node is a wrapper that tells Kedro how the function connects to the pipeline:
- Inputs: What data or parameters the function needs to run.
- Outputs: What the function produces as a result.
For example:
node(fill_na_with_zeros, inputs="raw_data", outputs="data_no_na")
This tells Kedro that:
- The function
fill_na_with_zeros
takesraw_data
as input. - The output of this function is named
data_no_na
, which will be passed to the next node in the pipeline.
The output of one node becomes the input for the next.
- The
fill_na_with_zeros
node producesdata_no_na
. - This becomes the input to the
clip_outliers
node, which also takes parameters (params:clip.lower
andparams:clip.upper
) from the pipeline’s parameter store. - Finally, the
normalize_data
node uses the output ofclip_outliers
(data_clipped
) and produces the final output,cleaned_data
.
Because the inputs and outputs are well defined, Kedro can, under the hood, build a DAG to determine the order of execution: it resolves all dependencies, ensuring nodes run in the correct sequence.
Now that we know the basics of how to write an ML workflow with Kedro, we need to understand how and where to define (1) the data inputs/outputs (raw_data
data_no_na
data_clipped
cleaned_data
) and (2) the parameters (params:clip.lower
params:clip.upper
). This is where the Data Catalog and parameter file comes into play!
2.2 The Data Catalog
Kedro Data Catalog is a centralized configuration file (catalog.yaml
) that keeps track of all your data objects. Kedro uses the terminology “dataset” to mean anything that can live in memory and/or on disk: traditional tables or CSV files, a list of strings, an image, a JSON object, or even a trained model. The Data Catalog defines where these objects live, how to load them, and how to save them, no matter their type or format.
This flexibility means you can manage everything in your python code—raw inputs, intermediate results, and final outputs—without hard-coding a single file path (as seen in the previous section)
Here’s an example catalog.yaml
file:
raw_data:
type: pandas.CSVDataset
filepath: data/01_raw/raw_data.csv
clean_data:
type: pandas.ParquetDataset
filepath: data/02_clean/clean_data.parquet
Let’s break down what’s happening in the example catalog.yaml
file:
raw_data
:- Type:
pandas.CSVDataset
tells Kedro that this dataset should be loaded as a Pandas DataFrame from a CSV file (ifraw_data
is a node input) and should be saved as a CSV file (ifraw_data
is a node output) - Filepath: The file is located at
data/01_raw/raw_data.csv
(relative to the root of your git repository).
- Type:
clean_data
:- Type:
pandas.ParqetDataset
tells Kedro that this dataset should be loaded as a Pandas DataFrame from a parquet file (ifclean_data
is a node input) and should be saved as a parquet file (ifclean_data
is a node output) - Filepath: The file is located at
data/02_clean/clean_data.parquet
(relative to the root of your git repository).
- Type:
When Kedro runs your pipeline, it handles these datasets automatically based on their configuration. All you need to do in your code is refer to the dataset by its name—Kedro takes care of loading and saving it.
Now, imagine you need to move your raw data from a local disk to an S3 bucket in production. Instead of rewriting your code, you simply update the catalog.yaml
file:
raw_data:
type: pandas.CSVDataset
filepath: s3://my-bucket/raw_data.csv
Better yet, Kedro’s configuration environments let you define separate settings for different scenarios (e.g., local vs. production):
conf/base/catalog.yaml
: Default configurations (e.g., local file paths).conf/prod/catalog.yaml
: Production configurations (e.g., S3 file paths).
To run your pipeline with the production environment, just use the CLI kedro run --env prod
.
The Data Catalog is far more than a file manager. It supports a wide range of connectors through the kedro-datasets library. You can connect to APIs, SQL databases, cloud storage, and more. And if your project needs something unique, you can easily create a custom dataset connector. The beauty of the Data Catalog is that we have now:
- No hard-coded paths: Your code stays clean and focused on logic.
- Seamless backend switching: Move between local files, S3, databases, or APIs with minimal effort.
- Flexibility: Kedro handles a variety of data formats and sources, so you’re not limited to simple files.
2.3 The Parameters
Parameters in Kedro are stored in simple YAML files (parameters.yaml
) that keep your pipeline settings, hyperparameters, and thresholds cleanly separated from your code.
Here’s an example parameters.yaml
file:
clip:
lower: 0
upper: 100
Let’s revisit the clip_outliers
node from the earlier example. Notice that it uses params:clip.lower
and params:clip.upper
as inputs. Kedro pulls these values from parameters.yaml
at runtime, so you don’t need to hard-code them into your function.
node(
clip_outliers,
inputs=["data_no_na", "params:clip.lower", "params:clip.upper"],
outputs="data_clipped"
)
This flexibility allows you to update parameters without touching the pipeline code. Need to experiment with a new threshold? Just update parameters.yaml
or override the values from the command line, e.g., kedro run --params "clip.lower=10,clip.upper=90"
.
Just like the Data Catalog, you can also use Kedro’s configuration environments to define parameters for different scenarios (e.g., local vs. production):
conf/base/parameters.yaml
: Default parameters for prototyping (e.g., fewer training epochs for fast iteration).conf/prod/parameters.yaml
: Production parameters for accuracy (e.g., more training epochs for robust models).
When you want to use the production environment, simply run kedro run --env prod
, and Kedro will automatically overwrite the base parameters with the production environment-specific ones! This makes it easy to manage settings for different stages of your workflow without duplicating code or configuration. This centralized, flexible approach to parameter management brings several key benefits:
- Clean Separation: Parameters are centralized, making it easy to see all the tunable settings for your pipeline in one place. This centralization also helps you minimize the risk of errors from mismatched settings.
- Version Control: Parameter files can be versioned alongside your code, ensuring reproducibility.
- Experimentation: Easily test different parameter values without changing the underlying pipeline logic.
- Environment Flexibility: Adapt parameters for local, testing, or production environments with minimal effort.
3. Conclusion
Machine learning workflows are exciting but messy. They’re iterative, experimental, and inherently complex—you’re juggling raw data, intermediate outputs, models, hyperparameters, and more. Without a solid structure, it’s easy for things to turn into a chaotic tangle of notebooks and scripts, making collaboration, scaling, and reproducibility much harder than they need to be.
This is exactly where Kedro can make your life easier. By bringing structure and best practices to ML projects, Kedro helps you organize your work into clean, modular pipelines, manage your data with ease, and keep your parameters flexible and centralized. Over time, it has helped me and my teams build ML projects faster, more robustly, and with far fewer headaches.
That said, Kedro is a big framework, and I’ve only scratched the surface here. Features like experiment logging, pipeline visualization, and powerful integrations make Kedro even more useful, but there’s just no way to cover everything in one post. If you’re intrigued by what you’ve seen so far, I highly recommend diving into Kedro’s documentation to explore its full potential.
Now, let’s address a concern I often hear: “It’s another framework to learn, and I just don’t have the time.”
I get it. But here’s the thing: whether you know it or not, you’re already building your own framework every time you organize functions, pass data between them, and try to keep things consistent. The problem is, most of us don’t do this consistently or in a way that’s easy to maintain or scale. Kedro gives you that structure out of the box. Writing nodes? That’s just writing Python functions. Building pipelines? You’re already doing that too, just less systematically.
The beauty of Kedro is its flexibility. It’s pure Python, so you can start small. Use Kedro to organize a single part of your project—maybe that messy preprocessing workflow—and see how it fits. You don’t have to commit to the whole framework upfront. Over time, you’ll see how Kedro’s structure and tools make your work cleaner, faster, and more enjoyable.
For me, Kedro has been a game-changer. It’s helped me stay organized, reduce complexity, and focus on what matters most: building great machine learning systems. If you’re ready to give it a try, start simple. Once you see the difference, I think you’ll wonder how you managed without it.
Happy coding! 🚀
Leandro Salemi, Data Scientist & ML Engineer at Helicon Technologies.