MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

Introduction to TF Serving

Guy Lecker | November 21, 2021

Machine learning (ML) model serving refers to the series of steps that allow you to create a service out of a trained model that a system can then ping to receive a relevant prediction output for an end user. 

These steps typically involve required pre-processing of the input, a prediction request to the model, and relevant post-processing of the model output to apply business logic. 

Out of the vast spectrum of tools and techniques that enable model serving, we will focus on TensorFlow Serving in this blog post.

Figure 1: The high-level steps of ML model serving (Source: Kubernetes Blog)

Still, serving machine learning models is not a solo act. For a full-scale production system, other external processes perform a vital support role to model serving: services that fetch and ensure consistency of features between serving and training time, services that monitor and log requests, services that automatically deploy the latest model(s), to name a few.

If you are solely interested in technical guidance on TF Serving and TensorFlow production deployment, you can jump to the What TensorFlow Serving Can Do for You section for a hands-on guide tailored to different levels of experience.

However, if you are also curious about how model serving fits into the intertwined scene of MLOps and how to move toward MLOps maturity, you’ll be interested in what’s coming next.

The Need for MLOps and the Central Role of Model Serving

The history of machine learning has been marked by a few major game-changing events, with the 2012 deep learning revolution being one of the first to come to mind.

2020 was no different, with a new transformational topic becoming the top priority for all big companies and at conferences of all sizes: MLOps.

For many years, the focus of data science teams had been on collecting data and iterating through increasingly complex modeling techniques, with the aim of proving the value of investing in ML.
Once the offline value had been proven, data scientists and engineering teams found themselves with a new challenge: serving the models in production to prove their online value.

We didn’t know it then, but at that moment, the need for MLOps was born.

Figure 2: The end-to-end Machine Learning lifecycle (Source: Google Cloud)

MLOps, i.e., DevOps for machine learning, is a discipline that aims to orchestrate and maintain the end-to-end machine learning lifecycle.

This lifecycle encompasses everything from defining the experimentation framework and automating training pipelines to CI/CD with feature drift monitoring and canary deployment; it also includes how engineering and science teams coordinate and model explainability. 

While some of these terms exist in standard DevOps practices, others are entirely new since with MLOps, you not only need to put the code into production but two additional artifacts as well—data and models. 

In the even-greater spectrum of tools and techniques that have emerged to tackle these unique MLOps needs, one pitfall teams should try to avoid is settling on a hybrid and disjointed catalog of tools and techniques that are hard to connect and maintain. In machine learning, more than in any other software discipline, the interplay between experimentation/development and production pipelines is circular and demands a unified MLOps infrastructure. 

The only way to reach MLOps maturity is to keep it simple and manageable within a well-defined set of best practices so your teams can focus on inter-communication and business value. 

Tools such as Iguazio and Kubeflow with TFX are examples of unified MLOps frameworks that you may want to explore to reach your full MLOps potential.

What TensorFlow Serving Can Do for You

TensorFlow Serving is one of the most classic serving systems for ML and has evolved over the years to cover an ever-wider range of production requirements.

The longevity and vast adoption of this tool are worth highlighting since these have yielded a lot of online support and inspiration for users, both new and old alike, which can make a real difference for developers.

Another important aspect to mention before diving into the technicalities is that TF Serving is for online serving—not batch serving. While the latter is performed offline with multiple data entries (e.g., scoring the probability that your customer base will respond positively to an email campaign), the former is performed online with single requests sent live (e.g., scoring an online transaction and stopping it if it is fraudulent).

You can typically perform batch serving using the same pipeline code and logic of training. With a bit of a simplification, you can just use a .predict() instead of a .fit() method. 

Online serving, on the other hand, requires that you create an endpoint that can be pinged in real time within an often stringent latency budget and high availability requirements. This endpoint also needs to be hosted on a server outside the TF Serving offering. 

Which server to use typically boils down to one question: managed or unmanaged? While it is typically recommended to go with managed tools such as AWS SageMaker or Iguazio, there are some reasons to prefer unmanaged solutions such as TensorFlow Serving on Kubernetes or virtual machines.

When SLAs are very strict or very specific settings are required, you can only obtain the customization required for optimization via unmanaged solutions. Hence, the need and beauty of ML tools such as TF Serving.

With this general understanding of where to host your server in mind, we can now discuss how to create the model server itself with TensorFlow Serving. 

How Does TensorFlow Serving Work?

TensorFlow Serving has a flexible and high-performant architecture that allows performing real-time model serving at scale and with varying degrees of customization. 

Figure 3: The main components of TensorFlow Serving (Source: TensorFlow)

This architecture is composed of a few abstractions, as seen in Figure 3, each implementing its own API, highlighted in bold below.

The first abstraction is the Source component. It identifies model servables or differences in model servables on the filesystem (S3, local, custom, and more) by listening to a specific file path. Any model deployment needs one SavedModelBundle (or more if it is a composite model) and any relevant lookup tables for embeddings or vocabulary. 

When a new model is identified, a Loader is created for that version to collect metadata, estimate system resources such as RAM, and output an Aspired Version. Once the DynamicManager receives the Aspired Version, it either creates a new deployment with the specified parameters if resources are available or requests an update for the existing model following the Version Policy. 

The Version Policy can either preserve availability, i.e., have the two models coexist until the new one is fully loaded to guarantee continuity of service, or preserve resources, i.e. retire the old model and then load the new model afterward when resources are limited. 

When the client sends a request, it will seamlessly receive the ServableHandle from the Manager with the requested model version loaded. The sequence of versions, in increasing order per their number, of a servable can be accessed via the servable stream. Finally, the Core manages the above lifecycle and metrics.

Now that you’ve had an overview of TF Serving’s powerful and user-friendly model journey, we can dive into the details of how to set up your model serving.

As previously mentioned, the framework has varying levels of customization that can be mapped to increasing levels of complexity in deployment. For this reason, we will arrange the how-to guide into Beginner, Intermediate, and Expert sections that build on top of each other to address increasingly complex functionalities.

Note: All sections assume a good base knowledge of TensorFlow.

A Beginner’s How-To Guide to Serving Models with TF Serving

Installing the Required Libraries

From an Ubuntu machine, run: 

apt-get update && apt-get install tensorflow-model-server

pip install tensorflow==2.6.0

pip install tensorflow-serving-api==2.6.0  # match distribution of tensorflow

pip install requests==2.26.0  # for prediction requests via Python

Note: If you are using a Ubuntu virtual machine, you’ll need to provide the TF Serving distribution URI as a package source first by running:

echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \

curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -

This installation flow should typically work. For any unexpected requirements, please refer to the official TensorFlow installation guide and make sure to be working in a clean Python3 virtual environment.

Creating the Model Servables

While TF Serving can also support frameworks other than TensorFlow, it is highly recommended to use TensorFlow for training your models when deploying with TF Serving.

After model training, TF Serving requires you to export the model in the SavedModel format. This TF-specific serialization format lets you save and restore a model’s weights in a protocol buffer file (saved_model.pb) along with variables/, assets/, and assets.extra/

For a keras model, run: tf.keras.models.save_model(trained_model, export_path)

For other models, run: tf.saved_model.save(trained_model, export_path)

Here, trained_model is the model variable, and export_path is the file path to where you will save the model. It’s as simple as this. 

Note: The above functionalities belong to Tensorflow 2.x. For TensorFlow 1.x models, which we recommend migrating from, you need to use SavedModelBuilder with SignatureDefs to export the model. Just use the following code:

sess = tf.compat.v1.InteractiveSession()

tensor_info_x = tf.saved_model.utils.build_tensor_info(model_input)

tensor_info_y = tf.saved_model.utils.build_tensor_info(model_output)

prediction_signature = tf.saved_model.predict_signature_def(

    inputs={input_name: tensor_info_x},

    outputs={output_name: tensor_info_y}

)

builder = tf.saved_model.builder.SavedModelBuilder(export_path)

builder.add_meta_graph_and_variables(

    sess,

    [tf.saved_model.SERVING],

    signature_def_map={

        tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature

    }

)

builder.save()

Here, model_input is the input variable to your model, and model_output is the output variable to your model, while input_name (e.g., ‘example’), and output_name (e.g., ‘predictions’) are how the model server expects input and output data to be addressed.

In both cases:

  • You should concatenate a model version number to the export_path for traceability.
  • You can explore the saved SignatureDef(s) by running saved_model_cli show --dir $export_path.

Serving the Model

Given the model is accessible in a specified model_base_path folder, tensorflow-model-server will create a local server for you that can be accessed with the specified model_name via a gRPC request at the specified port (8500 by default) or via a REST API at the specified rest_api_port (anything different from the port). 

So, input the following code:

tensorflow_model_server \

--port=8500 \

--rest_api_port=8501 \

--model_name=$model_name \

--model_base_path=$model_path

Here, $model_name and $model_path should be replaced by the relevant values.

For more information on available flags, please review the documentation you can find by inputting tensorflow_model_server --help

Sending a Request to Your Model

You can send requests to your model server in different ways, typically either via Python or cURL. (There’s also a server API in C++.)

By default, tensorflow-model-server expects a gRPC request, which is preferred over a REST API because it is typically faster and more robust.

For cURL, run:

curl -X POST http://localhost:8501/v1/models/$model_name -d '{\"$input_name\": $input_example}'

For Python, run:

import requests

import json

API_ENDPOINT = "http://localhost:8501/v1/models/$model_name"

data = {"$input_name": $input_example}

response = requests.post(url = API_ENDPOINT, data = data)

Here, $input_name is the input name specified in the prediction_signature_def, and $input_example is an input value that matches the features the model has been trained on.

Shutting Down Our Server

To shut the server down, run: 

ps aux | grep -i "tensorflow_model_server"

kill -9 PID

This will retrieve the PID of the tensorflow-model-server process and kill it. It’s not gracious, but it does the job.

An Intermediate How-To Guide to Serving Models with TF Serving

Containerizing the Model Application

When building any application, you should really be using Docker. So, assuming Docker is already correctly installed on your system, run the following code to create your server as a container in the background:

docker pull tensorflow/serving:2.6.0

docker run -t --rm -p 8500:8500 -p 8501:8501 -v $model_path -e $model_name -d tensorflow/serving &

You should replace $model_name and $model_path with the relevant values.

The container will run in detached mode (-d) with an allocated pseudo-tty (-t) and will be removed when it exits or when the daemon exits (--rm).

Note: If using GPUs can lead to better performance for your application, you may want to install tensorflow/serving:2.6.0-gpu instead and make sure your NVIDIA and nvidia-docker installation works as expected.

You should then serve the model on Kubernetes, or a managed Kubernetes version, for scalability and resilience.

Defining the Right Model Server Configuration

Your --model_name and --model_base_path allow you to serve a model quickly, but only one.

For more complex scenarios with multiple models where you need to configure options such as rollout strategies or polling frequency for new versions, you need to provide a ModelServer config file.

Your model’s config should contain something similar to:

model_config_list {

  config {

    name: 'my_first_model'

    base_path: '/tmp/my_first_model/'

    model_platform: 'tensorflow'

  }

  config {

    name: 'my_second_model'

    base_path: '/tmp/my_second_model/'

    model_platform: 'tensorflow'

  }

}

You can then provide the config file by adding the --model_config_file and --model_config_poll_wait_seconds flags to your TF Server start-up command. The latter simply instructs the server to periodically check for updated config files in the specified model_config_file file path.

One important consideration is that, by default, the server will serve the model with the latest version. You can, however, also define the --model_version_policy field to change this. For example, to serve two versions, you would input:

model_version_policy {

  specific {

    versions: 1

    versions: 2

  }

}

version_labels {

  key: 'stable'

  value: 1

}

version_labels {

  key: 'canary'

  value: 2

}

Then, the client would need to update their request to provide either a versions/<version number>, i.e., the value, or a labels/<version label>, i.e., the key, for the model.

An Expert How-To Guide on Serving Models With TF Serving

Batching Our Requests

Batching online requests together can be a good idea to optimize throughput performance and unlock the potential of accelerators such as GPUs that are I/O bound. 

TF Serving allows you to perform request batching by setting the --enable_batching and --batching_parameters_file flags, where the batching.config file may look like:

max_batch_size { value: 64 }

batch_timeout_micros { value: 0 }

max_enqueued_batches { value: 100000 }

num_batch_threads { value: 4 }

Refer to TF’s Batching Guide for more information on how to set up these parameters.

Note: The scheduling for batching is done globally for all models and versions to ensure optimal resource utilization.

Monitoring the Server

TensorFlow Serving gives you the capability to connect to Prometheus to monitor metrics from your model server. 

Simply provide the --monitoring_config_file flag to reference a MonitoringConfig, i.e., a protocol buffer file, which you can call “monitoring.config.” It would look like:

prometheus_config {

  enable: true,

  path: "/monitoring/prometheus/metrics"

}

Next, configure Prometheus Server to listen to your model server by providing the following to its deployment manifest:

containers:

  - image: tensorflow/serving:latest

     args:

     - --port=8500

     - --rest_api_port=8501

     - --model_config_file=/models/models.config

     - --monitoring_config_file=/etc/monitoring.config 

You will also be able to check if the metrics are correctly exported at http://localhost:8501/monitoring/prometheus/metrics.

Customizing TF Serving

TF Serving lets you customize almost every service it has, including to:

However, once you find yourself wanting to customize TF Serving, it could be time to look for something new.

What Comes After TF Serving?

Remember, model serving is only one part of the end-to-end machine learning lifecycle.

While TensorFlow Serving performs well as a standalone model server, other steps are usually required for putting models in production, such as data pre-/post-processing, lineage, monitoring, and automation.

You will need other tools to provide these capabilities to TensorFlow Serving. We recommend taking a look at open-source options like TensorFlow Extended and MLRun as a next step. The former is tailored for model training. The latter provides an end-to-end serverless MLOps orchestration framework—from development to deployment—that works anywhere, manages both data and models from any framework, and comes with a user-friendly interface as well.  You can check out the quick start guide here.

New call-to-action