MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

Deploying Your Hugging Face Models to Production at Scale with MLRun

Alexandra Quinn | November 17, 2022

Hugging Face is a popular model repository that provides simplified tools for building, training and deploying ML models. The growing adoption of Hugging Face usage among data professionals, alongside the increasing global need to become more efficient and sustainable when developing and deploying ML models, make Hugging Face an important technology and platform to learn and master. 

Together with MLRun, an open source platform for simplifying the deployment and management of MLOps pipelines, Hugging Face enables data scientists and engineers to get their models to production faster and in a more efficient manner.

In this blog post, we introduce Hugging Face and MLRun and show the value of running them together. This blog is based on the webinar “How to Easily Deploy Your Hugging Face Models to Production”. The webinar also shows a live demo of a Hugging Face deployment with MLRun, including data preparation, a real application pipeline, post-processing and model retraining.

You can watch the webinar, presented by Julien Simon, Chief Evangelist at Hugging Face, Noah Gift, MLOps Expert and author, and Yaron Haviv, co-founder and CTO of Iguazio, here.

How Transformers Have Revolutionized Deep Learning

One of the most recent trends in ML is the reinvention of deep learning. Or rather, it is more accurate to say that transformer models are “swallowing up” traditional deep learning. Traditional deep learning architectures like CNNs, LSTMs and RNNs, which attempt to solve problems with unstructured data like audio and images, were very popular until recently. 

But while these models have been proven to be efficient, the release of transformer models and Google BERT in 2018 has presented new and far more efficient architectures. These architectures can extract insights from unstructured data in a manner that surpasses state-of-the-art benchmarks. Such models include BERT, BART, GPT-2, GPT-3, CLIP, VISION TRANSFORMER, WHISPER (by OpenAI), CLIP, STABLE DIFFUSION (text to image) and WAV2VEC2 (speech to text by Meta).

The industry was quick to adopt them. The 2020 State of AI report called out transformers as a general purpose architecture for machine learning. The Kaggle data science survey saw RNN and CNN decrease in usage and popularity while transformers’ usage and popularity increases. Finally, the 2022 State of AI report shows the increase of transformer usage in research papers. A few years ago, 81% of usage was NLP-related. Now, it has become popular in multi-model, image, video and many other use cases. 

Hugging Face, Transformers and ML Models

Hugging Face, dubbed “The GitHub of Machine Learning”, is a collaboration platform/repository for ML models. Hugging Face enables developers of any level of proficiency to use simple tools to leverage and use state-of-the-art ML models. In just a few lines of Python code, developers can easily run complex models to solve their business problems. The complexity is taken care of under the hood by Hugging Face.

Hugging Face is most famously known for its Transformers Library, which is one of the fastest growing open source projects. The library lets users  download models, fine-tune them, predict with them, and more.

By using off-the-shelf models provided through Hugging Face, data scientists and engineers can minimize time and effort moving to production faster. The model can be fine-tuned and tweaked with transfer learning if needed by incorporating it with business-specific data. This way, teams can rely on state-of-the-art level performance aimed at their specific problem and the training will take less time, less money, yielding a production-ready model faster. This efficiency helps save on high compute costs in terms of both dollars and energy. 

Hugging Face currently hosts more than 80,000 models and more than 11,000 datasets. It is used by more than 10,000 organizations, including the world’s tech giants, like Google, NVIDIA and Microsoft.

Now let’s see how Hugging Face models can be brought to production.

Deploying Your Hugging Face Model to Production at Scale with MLRun

Accelerating the deployment of your Hugging Face model in production is essential for automating and scaling your MLOps workflow. Efficient deployment to production can be done with MLRun. MLRun is an open source solution contributed and maintained by Iguazio for automating MLOps workflows.

MLRun takes a production-first approach to building and managing ML applications, and accelerates model deployment by relieving data scientists and engineers from working in siloed environments. Without MLRun, data scientists write code, and then bring it to ML engineers who convert it to containers and elastic services, security, logging, etc. Then, they build workflows from individual functions that process the data, train models, and test, deploy and monitor them. These actions take place manually and separately. With the production-first approach of MLRun, data is collected and processed, and models are trained in a production-proof environment so they can be used in complex production pipelines, with no need to transfer files and code between teams.

MLRun comprises four main components:

  • The Feature Store - for feature engineering
  • The ML CI/CD pipeline - for automating the pipeline stages: testing, training, deployment etc.
  • The real-time serving pipeline - for deploying scalable data with real-time serverless technology
  • Real-time monitoring and retraining - for monitoring data, models and production components and providing a feedback loop for identifying drift, retraining, and more.

MLRun is made up of a client and a server. The server runs on Kubernetes and can run on any cloud environment, on-premises, on virtual machines and on Kubernetes. The client side is the user’s choice. It can be a Jupyter notebook or any other notebook, VS Code, SageMaker, Azure ML, PyCharm, etc.

The client side is used for building functions, testing them locally and launching them on the MLRun distributed cluster, which manages the operational aspects, like tracking and auto-scaling. In other words, once the model is developed in your IDE, you can deploy and manage anywhere with only one API click away.

With MLRun the entire flow is automated. The data scientist only needs to write the code in her or his IDE. Then, with a single API call or click, the code is converted into a fully elastic and scalable service that can be automatically run in a real-time or batch pipeline. Data is automatically fed into the system while MLRun enables triggering actions like pre-deployment, retraining, model tuning, and more.

MLRun can also be used for complex scenarios, like for financial institutions or NLP use cases that require a complete and complex application pipeline with multiple models. Hugging Face is built into MLRun for both serving and training and it can also be used to return models to Hugging Face to share with the community.

To learn more about MLRun, check out the extensive documentation, tutorials and videos, here.

LLMOps vs. MLOps: Understanding the Differences

Check out MLRun on GitHub

Get started with the latest release, check out documentation and browse demos.

Using MLRun with Hugging Face

How can Hugging Face models be deployed to production with MLRun? Let’s see how to build a serving pipeline with the model and then how to retrain or calibrate the model with a training flow that grabs the data, processes it, optimizes and redeploys. You can watch the entire demo here.

Workflow #1: Building a Serving Pipeline

Step 1: Create a project.

Step 2: Add a serving function with the serving steps you need. A simple serving function might include intercepting a message, pre-processing, sentiment analysis with the Hugging Face model and post-processing. But you can add more steps, branch out, etc.

As mentioned, Hugging Face is built into MLRun for both serving and training, so no additional building work is required on your end except for specifying the models you want to use.

Step 3: Simulate the application locally. MLRun builds a simulator around the serving function.

Step 4: Test the model. Push requests into the pipeline to verify it’s working. Debug if you need to.

Step 5: Deploy. Turn the model into a real-world endpoint that runs services and gets upgraded. The only action required is running a simple command. In the backend, MLRun is building containers, pushing them to repositories, and more and serving the entire pipeline. You now have an elastic auto-scaling service that is functioning.

Workflow #2: Building a Training Pipeline

Step 1: Create a project.

Step 2: Register the training functions. The training function includes the training methods, the evaluation and any other required information.

Step 3: Set the workflow. The workflow includes the various training steps: preparing the datasets, training the dataset based on the outputs of the data preparation, optimizing the model and deploying the function. The model can be deployed to any environment - production, development, staging, etc. - or to a few at the same time. These workflows can also be triggered automatically with CI systems.

Step 4: Run the pipeline. You can see the execution in MLRun’s UI. Since MLRun supports Hugging Face, the training artifacts are saved and can be used for comparisons, experiment tracking, and more.

Step 5: Test the pipeline. See that the prediction is different, following the model training.

Step 6: Deploy.

That’s it! Using Hugging Face and MLRun together significantly shortens the model development, training, testing, deployment and monitoring process. By getting your models to production faster, you can answer business needs faster while saving resources.

To learn more about MLRun and Hugging Face and automating your workflows, watch the entire video.