Still waiting for ML training to be over? Tired of running experiments manually? Not sure how to reproduce results? Wasting too much of your time on devops and data wrangling?
It’s okay if you’re a hobbyist, but data science models are meant to be incorporated into real business applications. Businesses won’t invest in data science if they don’t see a positive ROI. This calls for the adoption of an “engineered” approach — otherwise it is no more than a glorified science project with data.
Engineers use microservices and automated CI/CD (continuous integration and deployment) in modern agile development. You write code, push and it gets tested automatically on some cluster at scale. It then goes into some type of beta/Canary testing phase if it passes the test, and onto production from there. Kubernetes, a cloud-native cluster orchestrator, is the tool now widely used by developers and devops to build an agile application delivery environment.
Leading ML engineers and AI/ML-driven companies are already using Kubernetes, which comes with pre-integrated tools and scalable frameworks for data processing, machine learning and model deployment. Some Kubernetes frameworks also enable horizontal scaling and efficient use of GPUs, which further cut down overall wait times and costs.
Don’t settle for proprietary SaaS/cloud solutions or legacy architectures like Hadoop. Get a Kubernetes-based solution for a scalable and forward-looking data science platform. The following post reviews best practices for the process and Kubernetes frameworks used to scale and operationalize data science endeavors and provides detailed information about the Kubeflow ML toolkit for Kubernetes (K8s).
The Data Science Pipeline
Data scientists mostly go through the following steps during a development phase:
- Data collection from CSV or dense formats (e.g. Parquet) which entail manual extraction from external sources.
- Manual data labeling, exploration and enrichment to identify potential patterns and features.
- Model training and validation.
- Model testing with real/fresh data, or as part of a bigger application.
- Rinse and repeat when unhappy with the outcome.
Once done with development, against all odds, models are deployed into a production environment where fancy dashboards are on display to identify anomalies or predict behavior. Those of us who are more advanced in our AI journey also create AI-driven user interactions/apps or trigger alerts and actions.
But then data changes… model accuracy drifts… the users ask for update in apps and dashboard, etc… Now we find ourselves running much of the pipeline again to produce new models. This is where we start Googling a continuous development and integration approach, but more on that later.
The tools required to build our pipeline can be divided into the following six categories:
A. Automated data collectors from offline or online sources.
B. Interactive tools for data labeling, exploration and model development.
C. Analytics tools/databases to aggregate, join and pivot data at scale.
D. Parallel model training frameworks.
E. Model and API serving frameworks.
F. Management, monitoring and dashboards.
The only sane way to integrate all these tools under one platform is with Kubernetes!
Kubernetes, The ML Platform of Choice
Kubernetes hosts several packaged and pre-integrated data and data science frameworks on the same cluster. These are usually scalable or they auto-scale, and they’re defined/managed with a declarative approach: specify what your requirements are and the service will continuously seek to satisfy them, which provides resiliency and minimizes manual intervention.
Kubeflow is an open source project that groups leading relevant K8 frameworks. Kubeflow components include Jupyter notebooks, Kubeflow Pipeline (workflow and experiment management), scalable training services (for Tensorflow, PyTourch, Horovod, MXNet, Chainer), model serving solutions, and Iguazio’s serverless framework (Nuclio. Kubeflow also offers examples and pre-integrated/tested components.
In addition to typical data science tools, Kubernetes can host data analytics tools such as Spark or Presto, various databases and monitoring/logging solutions such as Prometheus, Grafana, Elastic Search, etc. It also enables the use of serverless functions (i.e. auto built/deployed/scaled code like AWS Lambda) for a variety of data-related tasks or APIs/model serving, read more on that in my post (Serverless: Can It Simplify Data Science Projects?).
The key advantage of Kubernetes vs proprietary cloud or SaaS solutions is that its tools are regularly added and upgraded, Google’s search and Stack overflow are often the fastest path to help and the solution can be deployed everywhere (any cloud or even on-prem or on a laptop). A community project also forces associated components/services to conform to a set of standards/abstractions which simplify interoperability, security, monitoring, etc.
You can build and manage a Kubernetes data science platform yourself (follow the doc), or use managed and pre-integrated solutions from cloud providers or companies like Iguazio.
Building an Efficient Data Science Pipeline with Kubeflow
Unfortunately, just betting on a credible platform is not enough. Believe me, I speak from experience: Life gets easier once you adopt three guiding rules:
- Make it functional — create reusable abstract functions/steps which can accept parameters.
- Make it scalable — apply parallelism to every step (or as often as possible, within reason).
- Automate — avoid manual and repetitive tasks by using declarative semantics and workflows.
The current trend in data science is to build “ML factories,” i.e. much like in agile software development, build automated pipelines which take data, pre-process it, run training, generate, deploy and monitor the models. The declarative and automated deployment and scaling approach offered by Kubernetes is a great baseline, but it’s missing a way to manage such pipelines on top of that.
A relatively new tool which is part of the Kubeflow project is Pipelines, a set of services and UI aimed at creating and managing ML pipelines. We can write our own code or build from a large set of pre-defined components and algorithms contributed by companies like Google, Amazon, Microsoft, IBM, NVIDIA, Iguazio, etc.