News

Announced at GTC Paris - #MLRun now integrates with NVIDIA#NeMo microservices

Deploying Gen AI in Production with NVIDIA NIM & MLRun

Guy Lecker | June 9, 2025

In less than three years, gen AI has become a staple technology in the business world. In November of 2022, OpenAI launched ChatGPT, with explosive growth of over 1 million users in just five days, galvanizing the widespread use of gen AI. Over the course of 2023 enterprises entered the experimentation stage and kicked off POCs with API services and open models including Llama 2, Mistral, NVIDIA and others. In 2024, organizations are setting aside dedicated budgets for gen AI while ramping up their efforts to build accelerated infrastructure to support gen AI in production.

In this blog post, we spotlight a leading player in the gen AI infrastructure ecosystem, NVIDIA, commonly known for their GPUs, software and research that have helped drive gen AI implementation and research. We introduce their new solution model deployment - NVIDIA NIM. Then, we show how to use NVIDIA NIM with MLRun to productize gen AI applications at scale and reduce risks, including a demo of a multi-agent banking chatbot.

The blog is based on the webinar Deploying Gen AI in Production with NVIDIA NIM & MLRun with Amit Bleiweiss, Senior Data Scientist at NVIDIA, and Yaron Haviv, co-founder and CTO and Guy Lecker, ML Engineering Team Lead at Iguazio (acquired by McKinsey). You can watch the entire webinar here.

Managed Gen AI Services vs. Do-It-Yourself

Organizations often have to choose between managed services, which provide ease of use, and control, through a do-it yourself deployment.

Gen AI managed services are often chosen because they provide easy to use APIs for development and infrastructure, creating a faster path for getting started with AI.

The downside of managed services is that they require organizations to share data and prompts externally with the provider, and the enterprise usually has limited control, impacting the overall organizational gen AI strategy. In addition, since payment is per inference, costs can easily spike.

On the other hand, a do-it-yourself deployment provides the ability to run anywhere across the data center or the cloud and the ability to securely manage data in a self-hosted environment.

The tradeoff of these benefits is that the organization must handle infrastructure optimization, custom code for APIs and fine-tuned models and ongoing maintenance and updates.

NVIDIA NIM bridges these two approaches.

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a suite of optimized, containerized microservices for deploying generative AI models across various infrastructures: cloud, on-prem, hybrid, etc. NIM facilitates the creation of tools like chatbots, AI assistants and other generative AI applications across cloud environments, data centers and workstations.

NIM offers developers and enterprises a standardized approach for integrating AI capabilities into applications.

NVIDIA NIM includes:

  • Prebuilt container and Helm chart, supporting Docker deployments
  • Industry standard APIs
  • Support for custom models
  • Domain specific code
  • Optimized inference engines for GPUs

Optimization model methods include:

  • KV caching - For saving GPU memory
  • Parallelization strategies - For runtime, allowing to choose being latency-sensitive and throughput-sensitive
  • In-flight batching - For making use of batches of different sizes

NIMs are not limited to text. They include regional-optimized text, visual, RAG, speech, digital human, healthcare, computer vision and simulation.

NVIDIA NIM enables data professionals to:

  • Deploy anywhere with security and control of Al applications and data.
  • Speed time to market with prebuilt, continuously maintained microservices.
  • Use the latest Al models, standard APIs and enterprise tools.
  • Optimize throughput and latency to maximize token generation and responsiveness.
  • Boost accuracy by tuning custom models from proprietary data sources.
  • Deploy in production with API stability, security patching, quality assurance and enterprise support.

For a deeper dive into How NVIDIA NIM works, including the pull sequence, how GPU engines are optimized to achieve 2x or 3x improved throughput, agent blueprints and an example of building a RAG, watch the webinar.

What Does It Take to Productize Gen AI Applications?

Before delving into how NVIDIA NIM works with MLRun, it's worth first understanding what organizations need to productize their gen AI applications.

Development is just the first step in gen AI development. It typically includes:

  1. Exploring relevant data assets.
  2. Building data ingestion and transformation pipelines.
  3. Developing data enrichment and RAG logic.
  4. Developing the gen Al application and agent workflow.
  5. Developing the front end application.
  6. Developing the fine-tuning and RLHF workflow (This is optional and depends on your business needs).

This takes a few weeks. Then, taking an application to production requires much more time and effort:

For data management, security and governance:

  • Automating, scaling, versioning and productizing data pipelines.
  • Ensuring data security, lineage and risk controls.
  • Adding application security (authentication, RBAC, auditing).
  • Adding real-time guardrails and hallucination protection.

For quality, scalability and continuous delivery:

  • Implementing modularity with LLM, data and API abstractions.
  • Implementing tests for models, prompts, application logic, etc.
  • Optimizing performance, costs and supporting workload elasticity.
  • Adding observability, logging and experiment tracking.
  • Building containers, microservices and cloud resource integrations.
  • Developing automated CI/CD pipelines (for data, models, and apps)

Finally, Live Operations:

  • Automating deployment processes and rollbacks.
  • Supporting health checks, recoverability and disaster recovery.
  • Managing resources, implementing FinOps and chargebacks.
  • Monitoring application performance, accuracy, drift, risk, etc.
  • Monitoring business KPIs and creating custom dashboards.

This can take months, even with a large team of engineers.

When planning this process, it’s important to note that many of these components can be reused across projects, and don’t require redevelopment. This means the same infrastructure can be used for multiple products, supporting scale and resource efficiency.

What is a Gen AI Factory?

A gen AI factory allows developers and users to quickly demo, build, deploy, and scale new gen Al applications, accessible through a portal.

The factory contains four pipelines:

  • Data Management
  • Development
  • Deployment
  • LiveOps

In addition, governance for de-risking the applications is required throughout.

What are the Use Cases for Gen AI Factories?

The versatility and efficiency of a gen AI factory means they can be leveraged across industries and organization types. In general, a gen AI factory becomes necessary:

1. When developers and data scientists need a gen Al app/tech playground.

2. When you need to select from a variety of pre-built gen Al apps and demo/customize them.

3. When you want to build your own production-grade gen Al apps.

Key gen AI factory attributes:

  • Pre-provisioned services to minimize time and hassle.
  • Support for the full gen AI lifecycle: data, model tuning/eval, app build, LiveOps, and more.
  • Support for hosted open-source/custom LLMs or external LLM services.
  • Pre-loaded with components, reusable recipes, wizards, and docs.
  • Dynamic resources (containers, GPUs, etc.) to enable scaling and reduce costs 
  • Integration with NVIDIA NIMS & NeMO.
  • Auth integration and charge-back.

The Iguazio gen AI factory comes pre-installed with:

  • MLRun, Kubeflow (pipeline)
  • MPI/Horvod, Ray
  • Prometheus, Grafana
  • Jupyter/Jupyter hub
  • DBs: MySQL/Postages, Redis
  • Milvus
  • Analytics: Spark, Kafka, Presto
  • Ingress, LLM Gateway + security

On-demand containers:

  • User Jobs/workflows
  • Nuclio serverless functions
  • Front-end applications
  • User pods/deployments
  • Auxiliary services

Iguazio runs run on prem, any cloud, hybrid and offers easy portability. Each user or application can access a personalized sandbox of microservices. This environment allows users to select components that fit their needs, such as Spark or Kafka, depending on their data processing and engineering requirements. Telemetry tools track resource usage (e.g., GPUs, CPUs) per user or project, enabling chargebacks and troubleshooting. This means developers can start building applications across the entire gen AI project lifecycle and without the hassle.

Now let’s see how this fits in with NVIDIA NIM.

Building, Customizing and Deploying Gen AI Applications with NVIDIA and MLRun

NIM provides a pre-baked container with multiple capabilities, like the LLMs, micro-batching, performance improvements, telemetry, etc.  When you need to deploy the container on a cluster, or multiple containers with multiple GPUs on the cluster, and elastically auto-scale, you will need a full pipeline - from data preparation, fine-tuning, storing in a model registry, etc.

These require orchestration, provided by MLRun which runs at the core of the Iguazio AI platform. MLRun is an open-source AI orchestration framework for managing ML and generative AI applications across their lifecycle. It automates data preparation, model tuning, customization, validation and optimization of ML models, LLMs and live AI applications over elastic resources.

The result is a production-worthy solution, combining cutting-edge technology with optimized performance and orchestration capabilities.

Let’s see how they both work together. 

Demo: Deploying Gen AI in Production with NVIDIA and MLRun

This demo walks through using MLRun and NIM to create a multi-agent banking chatbot. The process follows MLOps best practices, based on a production-ready mindset. You can follow this demo’s notebook here.

Key steps:

1. Project Setup and Model Deployment - An MLRun project is created, leveraging its serverless function capability to deploy and auto-scale an LLM-based model. In this demo, Llama 8B is selected for the chatbot.

2. Chatbot Design and Functionality - The chatbot incorporates intent classification to route queries to appropriate agents within the banking chatbot: loans, investments and general inquiries. Then, using Nvidia’s LLM via LangChain, a prompt template is crafted to test the intent classification, with a scenario where a loan request is classified correctly. In the scenario, $250,000 waswere requested to open a restaurant.

3. Operationalizing the Model with MLOps Tools - MLRun's LLM gateway is used, allowing model modularity and flexibility in switching or tuning models (e.g., OpenAI or Cohere) on demand. Key features include monitoring multiple use cases for the LLM. MLRun allows monitoring each use case individually for accuracy, essential in MLOps.

4. Monitoring and Evaluation - An LLM-as-a-Judge application is deployed to monitor and evaluate model performance, using OpenAI as the judge model for scoring predictions. In testing, the judge closely matched the model’s performance, verifying effective monitoring.

5. Final Workflow Setup - The intent classification setup is wrapped in a reusable Chain Runner class, allowing configuration for different applications beyond banking. The complete workflow integrates session tracking, refined query handling, agent choices, and history-saving components.

6. Chatbot Interaction - The chatbot successfully handles various queries, accurately switching between loan and investment agents based on context (e.g., house purchase inquiries). Demonstrations include responses about mortgage options and other investment types.

  • The chatbot accurately differentiates between loan and investment inquiries. When asked about buying a house, it classifies this as an investment question, providing relevant insights like location and market research factors.
  • Later, when specifically asked about loans for buying a house, it suggests a mortgage as the most common option.
  • When prompted about other investment options, the chatbot offers suggestions such as stocks and bonds, showcasing its multi-agent capabilities in handling diverse financial topics.

See the demo for yourself and the entire webinar, here.