What is Kubeflow Pipelines?

Kubeflow pipelines is a platform for scheduling multi-step and parallel-step ML workflows. Using the Kubeflow Pipelines UI, you can manage these ML workflows and their experiments, jobs and runs.

Each pipeline is a description of an ML workflow, containing all its components, when and how they should run, the definition of each kubeflow component’s inputs and outputs, and how they all combine.

Once the pipeline is developed and defined, it can be managed, run and shared via the Kubeflow Pipelines UI.

Why Use Kubeflow Pipelines?

ML pipelines often involve dozens of different tools, libraries, and frameworks. A machine learning pipeline tool like Kubeflow takes over the job of building, managing, and monitoring data processing pipelines.

Kubeflow Pipelines enables the orchestration of ML workflows in a simple and robust way. For example, say you are training a model from multiple data sets which each need to be processed in separate ways. You can create a pipeline which pulls the datasets, in two parallel steps processes the data and then combines the results to a single dataset which can then be used to train the kubeflow models in a subsequent step and finally run a prediction or serving function.

The pipelines can be run as experiments, so it is easy to try different ideas before finalizing the pipeline.

And once the pipeline is ready, both it and its components can be re-used when creating new pipelines without having to rebuild each time.

How Can Kubeflow Pipelines Be Used?

Pipelines can be created via the Kubeflow Pipelines Python SDK. Each Kubeflow component needs to be defined with some metadata to describe the component, its inputs and outputs, the container image and command to run.

Instead of a container image and run command, a component’s implementation can also be a Python function, in which case Kubeflow Pipelines wraps the function and prepares it as a component for you.

Components can also be shared, so a pipeline’s definition can use a pre-build component instead of defining a new one.

MLRun and Iguazio Integration

Kubeflow is one component of a larger ecosystem. To use Kubeflow in machine learning operations, it needs to be extended a bit . Kubeflow focuses on the model development pipeline for running and tracking ML experiments, however most users need additional services as outlined in the picture above, services for scalable data and feature engineering, model management, production/real-time pipelines, versioned and managed data repositories, managed runtimes, authentication/security, etc.

It is possible to create pipelines via the MLRun SDK, which abstracts some of the creation process and allows you to deploy ML workflows even faster. MLRun and Iguazio contain additional data services and management layers which complement and extend Kubeflow functionalities into a scalable operational, managed data science platform.

More Kubeflow Resources:

Blog: Kubeflow: Simplified, Extended, and Operationalized

Blog: How GPUaaS On Kubeflow Can Boost Your Productivity

Blog: Orchestrating ML Pipelines at Scale with Kubeflow

Want to learn more about extending Kubeflow into a full MLOps platform? Book a live demo here.