How to Deploy Your Hugging Face Model to Production at Scale - MLOps Live #20, Oct, 25 at 12pm ET
Developing a model is one thing, but serving a model in production is a completely different task. When a data scientist has a model ready, the next step is to deploy it in a way that it can server the application.
In general, there are two types of model serving: Batch and online. Batch means that you feed the model , typically as a scheduled job, with a large amount of data and write the output to a table. Online deployment means that you deploy the model with an endpoint so applications can send a request to the model and get a fast response at low latency.
The basic meaning of model serving is to host machine-learning models (on the cloud or on premises) and to make their functions available via API so that applications can incorporate AI into their systems. Model serving is crucial, as a business cannot offer AI products to a large user base without making its product accessible. Deploying a machine-learning model in production also involves resource management and model monitoring including operations stats as well as model drifts.
Any machine-learning application ends with a deployed model. Some require simple deployments, yet others involve more complex pipelines. Companies like Amazon, Microsoft, Google, and IBM provide tools to make it easier to deploy machine-learning models as web services. Moreover, advanced tools can automate tedious workflows to build your machine-learning model services.
In this post, you’ll learn what production-grade model serving actually is. We’ll also discuss model serving use cases, tools, and model serving with Iguazio.
A monolithic system may embed a machine-learning model and not expose the model available outside the system. This type of architecture requires every application using the same machine-learning model to own a copy. If there are many such applications, it quickly becomes a nightmare for MLOps. A better approach is to make the machine-learning model accessible to multiple applications via API. This deployment type has various names, including model serving, ML model serving, or machine learning serving, but they all mean the same thing.
Model serving, at a minimum, makes machine-learning models available via API. A production-grade API has the following extra functions:
There are many use cases for model serving. One example is data science for healthcare, which involves machine-learning deployment at scale in a complex clinical environment. These systems may monitor patients’ vital signs in real time and access medical histories and doctors’ diagnoses to make predictions that assist in making decisions and taking actions. As many applications will be working towards common goals, it makes sense to deploy real-time machine-learning models using model serving.
Another significant use case is data science for financial services. This spans various tasks, including risk monitoring, like real-time fraud detection, and personalization services based on customer preferences. AI-based trading algorithms are often time critical and must operate in a secure environment. Before deploying them to the real world, data scientists and system engineers should use a machine-learning test server equipped with the same API as the production systems, since a small latency or mistake can cause a massive impact on the bottom line.
Managing “serving as a model” for non-trivial AI products is not a simple task and can have a significant monetary impact on business operations. There are various ML serving tools for deploying machine-learning models in secure environments at scale, including:
Iguazio’s MLOps platform can ease the management of model serving pipelines. The platform ticks all the boxes when it comes to the major features required for model serving, complete with open-source tools like Iguazio’s MLRun and Nuclio.
MLRun is an MLOps orchestration tool for deploying machine-learning models in multiple containers. It can automate model training and validation and supports a real-time deployment pipeline. Its feature store can handle pre- and post-processing of requests. You can cut time to production by automating MLOps.
MLRun leverages Nuclio ,which is a serverless function platform, for deploying machine-learning models, making it easy to handle on-demand resource utilization and auto-scaling. It also integrates well with streaming engine (e.g. kafka, kinesis) so the function that runs the model runs it on live events. Moreover, it supports multiple worker processes ingesting data concurrently in real time.
MLrun provides a simple way for converting a machine-learning model to a deployable function by using an easy to use python SDK that can be run from your Jupyter notebook.
Model serving is a generic solution that works for any vertical that requires online serving, and Iguazio’s MLOps platform can help simplify building them all.