The open source ML tooling ecosystem has become vast in the last few years, with many tools covering different aspects of the complex and expansive process of building, deploying and managing AI in production. Some tools overlap in their capabilities while others complement each other nicely. In part because AI/ML is still an emerging and ever-evolving practice, the messaging around what all these tools can accomplish can be quite vague. In this article, we’ll dive into three tools to better understand their capabilities, the differences between them, and how they fit into the ML lifecycle.
Kubeflow, MLflow, and MLRun are popular open source tools in the ML ecosystem with similar attributes, that actually address different facets of the ML lifecycle. How are they alike?
- Each of them enables cross-functional collaboration, with some level of parameter, artifact, and model tracking.
- Each is supported and maintained by major players in the AI industry.
- Each of them addresses data science and experimentation requirements
While there are a handful of feature-level similarities among these three tools, it’s important to understand that they each solve different challenges, and one is not a true replacement for the other.
What is Kubeflow?
Kubeflow, started as an internal tool at Google and is now a multi-architecture, multi-cloud framework, which is described as the ML toolkit for Kubernetes. It provides a system for managing ML components on top of Kubernetes and acts as the bricks and mortar for model development, with a focus on the automation, scaling and tracking of ML pipelines.
Notable Kubeflow components include:
- Containerized Jupyter notebooks: Kubeflow users can create a Jupyter server with resource requirements and get provisioned containers.
- Kubeflow Pipelines: A platform for building and scheduling multi- and parallel-step ML workflows for training, data processing and transformation, validation and more. Kubeflow Pipelines can be managed, run, shared, and reused via the UI.
- Scalable training operators: Application scaling can be managed through various Kubeflow-hosted operators like TensorFlow, PyTorch, Horovod, MXNet, and Chainer.
- Model serving solutions: The Kubeflow project hosts KFServing, a basic model serving framework, and supports external serving frameworks like Seldon, Triton and MLRun (which we’ll discuss below)
Kubeflow contains some powerful components, making it a favored tool in the ML community: it’s Kubernetes native, highly customizable, open-source and loosely coupled. But it has some key limitations that need to be addressed for enterprise use. It’s most accurate to characterize Kubeflow as an ecosystem of tools rather than a holistic or integrated solution. Without a managed solution, data science teams will need to invest significant work hours to integrating the various Kubeflow and add-on components into a complete service.
The scope of Kubeflow focuses mainly on the model development pipeline for running and tracking ML experiments. For the production side, data science teams will need additional services for tasks like scalable data and feature engineering, model management, production and real-time pipelines, versioned and managed data repositories, managed runtimes, authentication/security, and so on.
What is MLflow?
MLflow was developed by Databricks as an open-source component for experiment tracking. While it now has additional functionality, this area is still where it really shines. It is a single Python package that covers some key steps in model management.
The key components of MLflow are:
Tracking: Includes an API and UI for logging and querying experiment parameters, code versions, metrics, and output files. Experiments can be logged and queried using Python, REST, R API, and Java API APIs.
Projects: a format for organizing and describing data science code in a reusable and reproducible way. The Projects component includes an API and command-line tools for running projects, making it possible to chain together projects into workflows.
Models: a standard format for packaging models, so they can be used by different downstream tools.
Model Registry: a centralized model store, set of APIs, and UI, so that MLflow Models can be managed collaboratively. The MLflow model registry provides model lineage, along with experiments and runs, versioning and other metadata.
MLflow is a great tool for experiment tracking, but there are many parts of the MLOps lifecycle that MLflow doesn’t cover, like applying automation, automating serverless functions, running jobs, model monitoring, preparing data logic, and so on. Databricks does have a solution for automated deployment and job execution, Databricks MLOps Stack, which is currently in private preview. This feature provides a customizable stack for production ML projects on Databricks, which might be worth checking out depending on your needs.
KubeFlow vs. MLflow: Key Differences
Kubeflow and MLflow are both open source ML tools that were started by major players in the ML industry, and they do have some overlaps. They both offer features that aid collaboration across multiple roles, both are scalable, portable and can be plugged into larger ML systems. While Kubeflow Pipelines is a widely used tool for scheduling multi-step and parallel-step ML workflows, MLflow does have an answer in the form of the data scientist-friendly MLflow Recipes, which are structured as git repositories with YAML-based config files and Python code. But they each take a different approach to developing and deploying ML applications. Though their respective capabilities have grown over the years, Kubeflow is fundamentally a container orchestration system, and MLflow is an experiment tracking tool.
Other key differences include:
- Kubeflow is an overly complex tool for data scientists because it speaks the language of engineers. It starts with Docker containers, YAML files and scripts, while data scientists are concerned more with Python and Jupyter. It’s much more efficient for data scientists to use familiar tools and UI portals, and have resources provisioned under the hood, which isn’t possible with Kubeflow alone.
- MLflow is much more accessible for data scientists, because the MLflow service merely listens in on parameters and metrics, while the actual runs happen on the data scientist’s local environment.
What is MLRun?
MLRun is an open MLOps orchestration framework for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun runs on any IDE on your local machine or on the cloud.
To understand how MLRun can be used, it’s helpful to look at how it applies to different tasks in the MLOps lifecycle:
Ingest and Process Data
To streamline the sometimes-complex task of ingesting and processing data, MLRun offers a simple UI to various offline and online data sources, support for batch or real-time data processing at scale, data lineage and versioning, structured and unstructured data, and more. MLRun also offers the only real-time feature store in the open source ecosystem. The feature store automates the collection, transformation, storage, catalog, serving, and monitoring of data features across the ML lifecycle and enables feature reuse and sharing.
Develop and Train Models
Easily build ML pipelines using data from various sources or the built-in feature store, train models at scale with multiple parameters, test models, track each experiment, and register, version and deploy models. MLRun provides scalable built-in or custom model training services that integrate with any framework and can work with 3rd party training/AutoML services. You can also bring your own pre-trained model from other platforms (see our demos to do this with Sagemaker or Azure) and use it in the pipeline.
Monitor, Manage and Alert
Observability is built into MLRun objects (data, functions, jobs, models, pipelines, etc.), eliminating the need for complex integrations and code instrumentation. View the application/model resource usage and model behavior (drift, performance, etc.), define custom app metrics, and trigger alerts or retraining jobs.
Project Management and CI/CD Automation
Assets, metadata, and services (data, functions, jobs, artifacts, models, secrets, etc.) are organized into projects. Projects can be imported and exported as a whole, mapped to git repositories or IDE projects (in PyCharm, VSCode, etc.), which enables versioning, collaboration, and CI/CD. Project access can be granted or restricted with roles-based management.
MLflow vs. MLRun: Key Differences
While there is some overlap between MLFlow and MLRun, they have totally different goals. MLRun isn’t an alternative to MLflow, and vice versa. MLRun is an end-to-end orchestration layer for ML and MLOps. It’s not primarily a tracking system, though it does offer that functionality. MLFlow offers a way to track your experiments, a component in the experimentation phase. There are also some ways to define metadata in MLFlow. Like MLRun however, there are many parts of the MLOps lifecycle that MLFlow doesn’t cover, like applying automation, automating serverless functions, running jobs, model monitoring, preparing data logic, and so on.
MLRun is for what we call AutoMLOps, where the entire operationalization process is automated. MLRun uses serverless function technology: write the code once, using your preferred development environment and simple “local” semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more. MLRun provides an open pluggable architecture, so you have the option to use MLFlow (or any other tool) for the development side, and then use MLRun to automate the production distributed training environment without adding glue logic.
So which tool is right for you? Depends on the task at hand. Luckily they are all open source - So try them out and see which one suits your current needs best.