Meet Us at ODSC West in San Francisco from Oct 31-Nov 1

What’s the best way to address scalability and performance challenges for a generative AI app?

Training or even tuning a model requires lots of computation. You might even need distributed frameworks.

There are two main distributed frameworks that are used in this space. Horovod, which interacts with the high-speed MPI layer for messaging underneath while running the decision training over TensorFlow or PyTorch. The second popular framework in space is Ray, which is also distributed Python. It allows you to distribute the workload and chart your model, whether it's based on the data or based on the model scaling across multiple computations.

There are also frameworks that address serving at scale. An LLM like GPT-3 requires a significant number of GPUs, let’s say anywhere between four and eight. In terms of the budget, they require $20,000 - $30,000 per month for serving. Smaller frameworks like OpenLLaMA may also perform well if you teach them enough knowledge, requiring only one or two GPUs.

There are also frameworks that know how to partition your model into multiple GPUs.

Currently, there is a lot of research around how to build distribution more efficiently both for the serving and the training aspects.

Interested in learning more?

Check out this 9 minute demo that covers MLOps best practices for generative AI applications.

View this webinar with QuantumBlack, AI by McKinsey covers the challenges of deploying and managing LLMs in live user-facing business applications.

Check out this demo and repo that demonstrates how to fine tune an LLM and build an application.

Need help?

Contact our team of experts or ask a question in the community.

Have a question?

Submit your questions on machine learning and data science to get answers from out team of data scientists, ML engineers and IT leaders.