Data scientists today have to choose between a massive toolbox where every item has its pros and cons. We love the simplicity of Python tools like pandas and Scikit-learn, the operation-readiness of Kubernetes, and the scalability of Spark and Hadoop, so we just use all of them.
What happens? Data scientists explore data using pandas, then data engineers use Spark to recode the same logic to scale or with live streams or operational databases. We keep walking the same path again and again whenever we need to swap datasets or change the logic. You have to manage the entire data and CI/CD pipelines manually, as well as taking care of the business logic and the clusters you build on Kubernetes, Hadoop, or possibly both.
It takes a DevOps army to manage all these siloed solutions. We end up like hamsters on treadmills, working hard without getting anywhere.
In addition, when you process your features using different tools or languages, you end up having different implementations, and different values, leading to skewed models and lower accuracy.
Well, here’s the good news — there’s an alternative and it’s in the form of MLOps for Python code. When you operationalize machine learning, you can deploy your Python code into production easily without rewriting it, gaining the same accuracy, super fast performance, scalability and operational readiness (logging, monitoring, security, etc.) This saves you significant time and resources. You can write your code in Python and run it fast without Spark and without any DevOps overhead.
MLOps for Python can revolutionize your ML pipelines
As the lyric goes, “You may say I’m a dreamer, but I’m not the only one!” I’m going to prove that it’s possible when you use Nuclio, an open source serverless pipeline automation platform together with RAPIDS, a free, open-source data science acceleration platform incubated by NVIDIA, to apply MLOps for Python code.
We worked with NVIDIA to integrate Iguazio’s PaaS and the Nuclio platform together with RAPIDS. With Python-based stream processing, you can access data speeds that are light-years ahead of any alternatives, as well as unprecedented scalability, without changing your Python pandas code. Plus it’s all serverless, so your operational overhead is low.
You’re about to see me crunch live data from Json-based logs, carry out real-time data analysis with Python, and feed the aggregated results into a compressed Parquet format that you can use for ML training or further queries. I’ll discuss both batch and real-time data streaming with Python code.
Note that the same code will run at rather high speeds even without GPUs, as it leverages Nuclio’s engine parallelism and resource optimization logic. The addition of GPUs takes performance to the extreme.
First, though, here’s an overview.
Why is Python so slow and difficult to scale?
We love Python tools, but you’ll only get reasonable performance when you use Python and pandas on a small dataset, as long as the whole dataset can fit in the memory and you process it using an optimized C code beneath the pandas and NumPy layer. If you need to process large datasets, you’ll have to apply data transformations, data copies, intense IO operation, and more which are all time-consuming tasks.
By nature, Python is synchronous thanks to the notorious GIL, making it highly inefficient at any complex tasks. Asynchronous Python does perform better, but it doesn’t resolve the embedded locking issues and it complicates development.
That’s the advantage of frameworks like Spark over Python and pandas. They use asynchronous engines such as Akka with memory-optimized data layouts that enable them to distribute work between various machines and workers. It’s understandable that they became the default because they deliver improved performance and stability.
RAPIDS puts Python on steroids
At NVIDIA, our friends had a great idea to keep Python-facing APIs on pandas, XGBoost, Scikit-learn and other popular frameworks, while using high-performance C code to process data in the GPU. To further speed up data transfer and data manipulation, they adopted Apache Arrow data format, which is memory-friendly.
The result is an efficient Python data science platform that builds and deploys Python code faster and more accurately. RAPIDS supports data IO (cuIO), data analytics (cuDF) and machine learning (cuML) because the same memory structures are shared by all the components. It forms a pipeline of data ingestion, analytics, and ML that doesn’t need to copy data back and forth into the CPU.
Here’s an example of a use case reading a large Json file (1.2 GB) and aggregating the data with pandas API. It shows that you can run the same Python code 30 times faster with RAPIDS (see the full notebook here). Without IO, that jumps to 100 times faster, which opens up the possibility of carrying out far more complex computation on the same data.
Let's highlight how groundbreaking this is. We used just one GPU (NVIDIA T4), increasing the cost of the server by around a third, and in return we enjoyed performance that’s 30x faster. To put it another way, we needed only a few lines of Python code to process 1 GB of complex data per second. It’s pretty incredible!
That’s not all!
At Iguazio, we also came up with an open-source pipeline orchestration platform that we call MLRun. With MLRun, you can automatically package your Jupyter code as a job microservice, conduct numerous experiments at the same time, each with its own combination of algorithmic functions and parameters, trusting MLRun’s capabilities to select the best results each time. It runs locally on your computer, or integrates natively with Kubeflow Pipelines for end-to-end learning workflows. MLRun tracks all the variables in every experiment, both inputs and outputs, to help you keep on top of documentation.
Imagine what we could do if we pack the code inside a serverless function? It could run on user request or at intervals to read or write to dynamically attached data volumes.
Can we achieve real-time data streaming with Python?
Now that we’ve seen what MLOps for Python is capable of, what about pushing it a step further and attempting real-time data streaming or real-time data analysis with Python?
Well, we’ve done that too! This code taken from the Kafka best practice guides reads from a stream and does minimal processing.
Like we said already, because Python is synchronous it’s not very efficient at real-time, complex data manipulation. We can only reach throughput of a few thousand messages per second with this program. If you add Json and pandas processing like in our previous example (see notebook), it degrades performance even more, down to just 18MB/s.
Does that mean we have no choice but to move stream processing back to Spark?
No, no and no.
We can use Nuclio to speed up data ingestion with Python, and everything else too. Nuclio is the fastest serverless framework, and it’s part of Kubeflow (Kubernetes ML framework). Nuclio runs multiple code languages wrapped by a real-time and highly concurrent execution engine, allowing it to run numerous instances of code, plus efficient micro-threads, in parallel without extra coding. Nuclio handles auto-scaling within the process and across multiple processes/containers (see this tech coverage blog).
There are 14 different triggering or streaming protocols (including HTTP, Kafka, Kinesis, Cron, batch) that Nuclio can support, Nuclio automatically handles all aspects of stream partitioning, checkpoints and high-availability. The triggers are specified through configuration (without changing the code), enabling fast access to external data volumes. It uses a simple function handler to invoke functions and can manage stream processing and data access in highly optimized binary code. A single Nuclio function can process hundreds of thousands of messages per second, and achieve throughput of more than a GB/sec.
What’s far more important is that only Nuclio’s serverless framework has optimized NVIDIA GPU support. Nuclio can maximize GPU utilization, and scale out to more processes and GPUs if needed.
I may be biased, but that doesn’t change that fact that all of this is 100% true.
MLOps for Python delivers Python-based stream processing that’s 30x faster without DevOps
When you bring together RAPIDS and Nuclio, you can reach the true paradise of GPU-accelerated, Python-based stream processing. This next code is mainly similar to the previous batch processing case, but we’ve made some changes. We put it into a function handler and grouped incoming messages together to make fewer GPU calls (see the full notebook).
We can use a Kafka or an HTTP trigger to test the same function, and Nuclio will still cope with the parallelism. Nuclio divides the stream partitions to several workers without requiring us to carry out any extra development work.
Here’s our setup: a single Nuclio function process (on a dual socket Intel server with one NVIDIA T4) and a 3-node Kafka cluster. With that Python data science platform, we were able to process 638 MB/s! We’re talking speeds that are 30x higher than when you write your own Python Kafka client, using tiny Python code. Plus, our setup autoscales to respond to any amount of traffic.
In our tests, we saw that the GPU was still underutilized. Using MLOps for Python streamlines data ingestion with Python, and still leaves spare capacity to carry out complex data manipulation and computation (joins, ML predictions, transformations, etc.) without compromising on performance.
The true gift of MLOps for Python
The fact that our Python-based stream processing delivered better and faster performance for decreased development time and costs is only the tip of the iceberg. The real advantage of using serverless solutions is that they’re “serverless” (read more in my post). You could use the same code, develop it in a notebook (see this example) or your chosen IDE, and it would take just a single command for it to be built, containerized, and shipped to a Kubernetes cluster with full instrumentation (logs, monitoring, auto-scaling) and security hardening.
Integrating Nuclio with Kubeflow Pipelines and MLRun means you’re actualizing true MLOps for Python. Use it to create multi-stage data or ML pipelines in order to automate your data science workflow and collect execution and artifact metadata with minimal effort, so you can quickly and simply reproduce experiment results.
Download Nuclio here and deploy it on your Kubernetes (see RAPIDS examples). Check out this related article Life After Hadoop: Getting Data Science to Work for Your Business.