Data Science & MLOps: Kedro for Reliable ML Workflows

Dec 19, 2024 9 min read

Taking ML from idea to production can be chaotic—messy pipelines, untracked data, and scattered scripts. Enter Kedro, a Python framework that brings structure, flexibility, and scalability to your workflows. Let’s dive into how it works! 🚀

Published

Dec 19, 2024

Area

Engineering

Data Science & MLOps: Kedro for Reliable ML Workflows

So, you’ve built a great machine learning (ML) model. It works in your notebook, your metrics look solid, and you’re feeling pretty good. But what happens when you try to take that model into production? Suddenly, it’s like crossing a bridge with missing planks—everything starts falling apart.

This isn’t just you. ML workflows are tricky, and they come with challenges that are not typically encountered in traditional software engineering. Today, I’m going to break down these challenges and introduce you to Kedro, a tool that can help you streamline the organization of your ML code, bringing you one step closer to production-ready ML systems.

1. Challenges of ML workflows

Data science problems rarely start with clear boundaries. You’re often handed an ill-defined question, and as you explore the data, new patterns emerge, new ideas surface, and suddenly, the problem itself shifts. Before you know it, you’re tweaking models and target metrics, and rewrite pipelines. This is because data science is highly experimental by nature: it’s less about writing perfect, modular code and more about prototyping fast and seeing what works. But this comes with a cost: pipelines get messy, modularity goes out the window, and suddenly maintaining or scaling the project becomes painful.

On top of this experimental nature, you need to keep track of datasets, model parameters, hyperparameters, and experiment results—all of which can drastically impact outcomes. Update a hyperparameter? Your model’s performance shifts entirely. Git can handle your code, sure, but it wasn’t built to version datasets, metrics, or experiment states.

The result? ML workflows quickly become a tangle of notebooks, scripts, and results, with no easy way to trace what changed, when it changed, or why things suddenly stopped working. This is where MLOps tools like Kedro and MLflow step in. They’re built to bring structure to the chaos—helping you build reproducible workflows and making it easier to experiment, iterate, and scale without losing your mind.

In this post, I’ll focus on Kedro and how it simplifies building modular pipelines, managing data, and centralizing parameters.

2. Kedro: Bringing Structure to ML Workflows

Kedro is an open-source Python framework that helps you build clean, modular, and reproducible ML workflows. Think of it as applying software engineering best practices to your data science projects.

Kedro allows you to decouple three critical components of your ML worfklow:

The Code: Modular and reusable pipelines that cleanly separate each task.
The Data Sources: Managed through a centralized Data Catalog, which encode “where and how to read/write things”.
The Parameters: Defined in dedicated parameter files, keeping your hyperparameters clean, versioned, and easy to tweak (even at runtime, I’ll show you later!)

This separation is foundational to Kedro. By keeping code, data sources, and parameters independent, your workflow becomes flexible, scalable, and production-ready. Swap datasets, tweak hyperparameters, or move to production—no pipeline code changes needed.

Plus, this structure adds clarity to ML projects, making pipelines easier to understand, maintain, and debug. New team members can be onboarded without wading through scattered scripts or notebooks.

Let’s break down how it all works and why it matters.

2.1 The Code

Kedro encourages you to break your workflow into small, focused functions—each doing one thing well. These functions can then be organized as nodes in a directed acyclic graph (DAG), or what we commonly call a “pipeline.” Building the pipeline is simple: you list your nodes in a pipeline object, and Kedro takes care of the rest. It automagically figures out the dependencies between nodes, ensuring everything runs in the right order.

Let’s look at an example. Say you’re building a pipeline to clean some data. Instead of cramming all the cleaning steps into one massive function, let’s define each step properly

import pandas as pd

def fill_na_with_zeros(data: pd.DataFrame) -> pd.DataFrame:
    """Fill missing values with zeros."""
    return data.fillna(0)

def clip_outliers(data: pd.DataFrame, lower: float, upper: float) -> pd.DataFrame:
    """Clip outliers in the dataset based on given thresholds."""
    return data.clip(lower=lower, upper=upper)

def normalize_data(data: pd.DataFrame) -> pd.DataFrame:
    """Normalize the data between 0 and 1."""
    return (data - data.min()) / (data.max() - data.min())

Here, each function (or node) does one thing and one thing only—a classic software engineering best practice. Now, let’s define a pipeline that connects these tasks:

from kedro.pipeline import Pipeline, node

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(fill_na_with_zeros, inputs="raw_data", outputs="data_no_na"),
            node(
                clip_outliers,
                inputs=["data_no_na", "params:clip.lower", "params:clip.upper"],
                outputs="data_clipped",
            ),
            node(normalize_data, inputs="data_clipped", outputs="cleaned_data"),
        ]
    )

Let’s break this down step by step to understand what’s happening in the pipeline creation and how Kedro organizes your workflow!

Each function—fill_na_with_zeros, clip_outliers, and normalize_data—is wrapped into a node. A node is a wrapper that tells Kedro how the function connects to the pipeline:

Inputs: What data or parameters the function needs to run.
Outputs: What the function produces as a result.

For example:

node(fill_na_with_zeros, inputs="raw_data", outputs="data_no_na")

This tells Kedro that:

The function fill_na_with_zeros takes raw_data as input.
The output of this function is named data_no_na, which will be passed to the next node in the pipeline.

The output of one node becomes the input for the next.

The fill_na_with_zeros node produces data_no_na.
This becomes the input to the clip_outliers node, which also takes parameters (params:clip.lower and params:clip.upper) from the pipeline’s parameter store.
Finally, the normalize_data node uses the output of clip_outliers (data_clipped) and produces the final output, cleaned_data.

Because the inputs and outputs are well defined, Kedro can, under the hood, build a DAG to determine the order of execution: it resolves all dependencies, ensuring nodes run in the correct sequence.

Now that we know the basics of how to write an ML workflow with Kedro, we need to understand how and where to define (1) the data inputs/outputs (raw_data data_no_na data_clipped cleaned_data) and (2) the parameters (params:clip.lower params:clip.upper). This is where the Data Catalog and parameter file comes into play!

2.2 The Data Catalog

Kedro Data Catalog is a centralized configuration file (catalog.yaml) that keeps track of all your data objects. Kedro uses the terminology “dataset” to mean anything that can live in memory and/or on disk: traditional tables or CSV files, a list of strings, an image, a JSON object, or even a trained model. The Data Catalog defines where these objects live, how to load them, and how to save them, no matter their type or format.

This flexibility means you can manage everything in your python code—raw inputs, intermediate results, and final outputs—without hard-coding a single file path (as seen in the previous section)

Here’s an example catalog.yaml file:

raw_data:
  type: pandas.CSVDataset
  filepath: data/01_raw/raw_data.csv

clean_data:
  type: pandas.ParquetDataset
  filepath: data/02_clean/clean_data.parquet

Let’s break down what’s happening in the example catalog.yaml file:

raw_data:
- Type: pandas.CSVDataset tells Kedro that this dataset should be loaded as a Pandas DataFrame from a CSV file (if raw_data is a node input) and should be saved as a CSV file (if raw_data is a node output)
- Filepath: The file is located at data/01_raw/raw_data.csv (relative to the root of your git repository).
clean_data:
- Type: pandas.ParqetDataset tells Kedro that this dataset should be loaded as a Pandas DataFrame from a parquet file (if clean_data is a node input) and should be saved as a parquet file (if clean_data is a node output)
- Filepath: The file is located at data/02_clean/clean_data.parquet (relative to the root of your git repository).

When Kedro runs your pipeline, it handles these datasets automatically based on their configuration. All you need to do in your code is refer to the dataset by its name—Kedro takes care of loading and saving it.

Now, imagine you need to move your raw data from a local disk to an S3 bucket in production. Instead of rewriting your code, you simply update the catalog.yaml file:

raw_data:
  type: pandas.CSVDataset
  filepath: s3://my-bucket/raw_data.csv

Better yet, Kedro’s configuration environments let you define separate settings for different scenarios (e.g., local vs. production):

conf/base/catalog.yaml: Default configurations (e.g., local file paths).
conf/prod/catalog.yaml: Production configurations (e.g., S3 file paths).

To run your pipeline with the production environment, just use the CLI kedro run --env prod.

The Data Catalog is far more than a file manager. It supports a wide range of connectors through the kedro-datasets library. You can connect to APIs, SQL databases, cloud storage, and more. And if your project needs something unique, you can easily create a custom dataset connector. The beauty of the Data Catalog is that we have now:

No hard-coded paths: Your code stays clean and focused on logic.
Seamless backend switching: Move between local files, S3, databases, or APIs with minimal effort.
Flexibility: Kedro handles a variety of data formats and sources, so you’re not limited to simple files.

2.3 The Parameters

Parameters in Kedro are stored in simple YAML files (parameters.yaml) that keep your pipeline settings, hyperparameters, and thresholds cleanly separated from your code.

Here’s an example parameters.yaml file:

clip:
  lower: 0
  upper: 100

Let’s revisit the clip_outliers node from the earlier example. Notice that it uses params:clip.lower and params:clip.upper as inputs. Kedro pulls these values from parameters.yaml at runtime, so you don’t need to hard-code them into your function.

node(
    clip_outliers,
    inputs=["data_no_na", "params:clip.lower", "params:clip.upper"],
    outputs="data_clipped"
)

This flexibility allows you to update parameters without touching the pipeline code. Need to experiment with a new threshold? Just update parameters.yaml or override the values from the command line, e.g., kedro run --params "clip.lower=10,clip.upper=90".

Just like the Data Catalog, you can also use Kedro’s configuration environments to define parameters for different scenarios (e.g., local vs. production):

conf/base/parameters.yaml: Default parameters for prototyping (e.g., fewer training epochs for fast iteration).
conf/prod/parameters.yaml: Production parameters for accuracy (e.g., more training epochs for robust models).

When you want to use the production environment, simply run kedro run --env prod, and Kedro will automatically overwrite the base parameters with the production environment-specific ones! This makes it easy to manage settings for different stages of your workflow without duplicating code or configuration. This centralized, flexible approach to parameter management brings several key benefits:

Clean Separation: Parameters are centralized, making it easy to see all the tunable settings for your pipeline in one place. This centralization also helps you minimize the risk of errors from mismatched settings.
Version Control: Parameter files can be versioned alongside your code, ensuring reproducibility.
Experimentation: Easily test different parameter values without changing the underlying pipeline logic.
Environment Flexibility: Adapt parameters for local, testing, or production environments with minimal effort.

3. Conclusion

Machine learning workflows are exciting but messy. They’re iterative, experimental, and inherently complex—you’re juggling raw data, intermediate outputs, models, hyperparameters, and more. Without a solid structure, it’s easy for things to turn into a chaotic tangle of notebooks and scripts, making collaboration, scaling, and reproducibility much harder than they need to be.

This is exactly where Kedro can make your life easier. By bringing structure and best practices to ML projects, Kedro helps you organize your work into clean, modular pipelines, manage your data with ease, and keep your parameters flexible and centralized. Over time, it has helped me and my teams build ML projects faster, more robustly, and with far fewer headaches.

That said, Kedro is a big framework, and I’ve only scratched the surface here. Features like experiment logging, pipeline visualization, and powerful integrations make Kedro even more useful, but there’s just no way to cover everything in one post. If you’re intrigued by what you’ve seen so far, I highly recommend diving into Kedro’s documentation to explore its full potential.

Now, let’s address a concern I often hear: “It’s another framework to learn, and I just don’t have the time.”

I get it. But here’s the thing: whether you know it or not, you’re already building your own framework every time you organize functions, pass data between them, and try to keep things consistent. The problem is, most of us don’t do this consistently or in a way that’s easy to maintain or scale. Kedro gives you that structure out of the box. Writing nodes? That’s just writing Python functions. Building pipelines? You’re already doing that too, just less systematically.

The beauty of Kedro is its flexibility. It’s pure Python, so you can start small. Use Kedro to organize a single part of your project—maybe that messy preprocessing workflow—and see how it fits. You don’t have to commit to the whole framework upfront. Over time, you’ll see how Kedro’s structure and tools make your work cleaner, faster, and more enjoyable.

For me, Kedro has been a game-changer. It’s helped me stay organized, reduce complexity, and focus on what matters most: building great machine learning systems. If you’re ready to give it a try, start simple. Once you see the difference, I think you’ll wonder how you managed without it.

Happy coding! 🚀

Let's explore your challenges

With decades of combined experience in cleantech, our specialists are ready to tackle your toughest challenges. Let's start a conversation and explore how we can help.

Schedule meeting