MLOps Live

Upcoming webinar: Implementing Gen AI in Highly Regulated Environments with Iguazio's CEO & CTO, June 25th 9am PST

LLM Validation and Evaluation

Alexandra Quinn | May 21, 2024

LLM evaluation is the process of assessing the performance and capabilities of LLMs. This helps determine how well the model understands and generates language, ensuring that it meets the specific needs of applications. There are multiple ways to perform LLM evaluation, each with different advantages. In this blog post, we explain the role of LLM evaluation in AI lifecycles and the different types of LLM evaluation methods. In the end, we show a demo of a chatbot that was developed with crowdsourcing.

This blog post is based on a webinar with Ehud Barnea, PhD, Head of AI at Tasq. AI, Yaron Haviv, co-founder and CTO of Iguazio (acquired by McKinsey) and Guy Lecker, ML Engineer Team Lead at Iguazio, which you can watch here.

Gen AI Reference Architecture Following Established ML Lifecycles

Building generative AI applications requires four main elements:

  • Data management - Ingesting, preparing and indexing the data.
  • Model development - Taking the existing baseline model, training and tuning if required with additional data gathered from data processing or through a feedback loop, evaluating, testing and deployment.
  • Application deployment - Receiving the prompt engineering requests, running through LLMs and post-processing.
  • LiveOps - Monitoring and validation of information from the application pipeline onto monitoring frameworks, labeling, filtering and sending to fine-tuning if required.
LLM Validation and Evaluation

To build accurate and risk-free models, improve performance and implement governance, it’s important to place data and model controls in gen AI apps. This requires controls in the pipelines, in multiple places. This includes:

  • Cleaning and preparing data during pre-processing, before it is sent to the model.
  • Testing the model with different prompts and verifying outputs and behavior during model application testing.
  • Checking incoming user requests and prompts in real-time.
  • Monitoring LLM usage by use case and user and logging activities.
  • Verifying compliance and quality of results in real-time before they are sent to the users.
  • Monitoring application logs for compliance, accuracy and quality of results.
LLM Validation and Evaluation

This is where LLM evaluation fits in.

Solving the LLM Evaluation Production Hurdle

According to Ehud Barnea, PhD, Head of AI at Tasq. AI, it seems like every time a new open source LLM model becomes publicly available, the data science community gets hyped and excited about its qualities. The potential of its accuracy, security, quality, and other features make it seem like a good option for corporate use cases.

However, when trying to use them, reality sets in and they tend to discover that performance isn’t as good as they expected. This is because standard benchmarks don’t reflect reality.

It's not that the tipping point hasn't been reached. Open-source models like Mistral AI and Claude can perform just as well as GPT. But with visibility still lacking, it’s difficult for companies to know if these models can be helpful for their own data and use cases. When deploying to production, results might be unexpected and even have significant business implications. Model evaluation can help determine the model's fit.

When is Evaluation Needed?

There are three places evaluation is needed:

  1. Deployment and CI/CD - Before the model reaches production.
  2. Production monitoring - After production to ensure accuracy.
  3. During development - Training iterations to improve results. This is more challenging with generative AI.

Evaluation Methods

There are two main types of LLM evaluation methods: algorithmic and human:

  • Algorithmic Methods -
    • “Classic” methods, like BLEU, ROUGE, BertScore, etc. These methods are based on similarity with reference answers. They are not suitable for LLMs since there is no single right answer for LLMs.
    • LLM Judges - Using LLMs to evaluate LLMs. This method is suitable for LLMs. However, it is less accurate and biased. For example, many models prefer responses from their own model.
  • Human Preference Methods
    • Internal Employees- Using employees, mainly R&D, to evaluate generated responses. However, this method is costly and unscalable.
    • Hiring Annotators - Hiring a team to provide feedback. However, this is not scalable and requires managerial efforts.  
    • User Feedback (Data Flywheel) - When users provide feedback about the response. However, this is biased and limited, since users do not want to share feedback all the time. It is also only relevant for models in production.
    • Global Crowd - Asking people from around the world. This is scalable and unbiased, since it leverages people from different countries, cultures and languages. However, it is much harder to set up.

Today, many companies use a combination of these evaluation methods.

Evaluating LLMs with a Global Crowd: Mistral vs. ChatGPT

Let's look at an example of a global crowd evaluation in action: Mistral and ChatGPT were prompted to rewrite an angry X Tweet in a positive mindset while telling the same story.

  • The original Tweet: I feel like my go to emotion is angry.
  • ChatGPT: I believe that my default emotion is passion.
  • Mistral: Despite feeling angry at times, I believe there’s room for growth and improvement in managing my emotions.

Both models were able to generate content with a positive mindset, but only Mistral kept the same story.

To evaluate the qualities of these outputs, human guidance was used. A global crowd was defined from English-speaking countries. They were asked which sentence is more positive. Mistral was found by the crowd to be 87% more positive and also wrote nicer texts.

However, when Mistral was limited to writing the same length of text as ChatGPT, it was only 58% more positive, and ChatGPT was found to be 57% more closer in meaning.

Demo: Developing a Chatbot with LLM Evaluation Crowdsourcing

In this demo, we are going to develop a chatbot for a fashion company and show how to use crowdsourcing for evaluation, using MLRun.

The chatbot should:

  • Recommend website items based on the customer’s preferences.
  • Offer fashion tips according to the website styling.
  • Answer questions about website policies, like shipping and returns.
  • Be polite and follow the company’s code of conduct.

The production serving pipeline steps:

  1. Prompt
  2. Session loading
  3. Query refinement
  4. Subject classification
  5. Toxicity filter
  6. Application logic
  7. History saving
  8. Response

In addition, subject classification and toxicity filtering can also go directly to history saving.

Model servers (LLM, CNN, etc.) collect inputs from query refinement, subject classification and toxicity filtering to inference the LLM and fine-tuning.

It looks like this:

LLM Validation and Evaluation

For this demo, we chose the LLM Mistral-7b-Instruct-v0.2. This is because it has a good context window, cost to serve is fairly low and response quality is high.

LLM Validation and Evaluation

To improve the LLM, we used RLHF (Reinforcement Learning from Human Feedback). Using MLRun’s model server, we collected the input and output prompts for various nodes within the application graph.

Then, we wanted to achieve the following:

  • Reduce prompt size - Using fashion tips RAG increased each clothing matching prompt. So we removed the fashion tips and inferred Mistral with its “default taste” in fashion so the model adjusts itself to the crowd’s taste.
  • Insert guardrails - Adding another layer of defense to the model. So we added ~30 samples of input prompts and the expected outputs. Then we inferred them with Mistral.
  • Improve Responses - Responses were not always fitting and friendly. So we prompt engineered guidelines and a code of conduct and inferred the model again.

We then used crowdsourcing to determine which output, the original or the newly inferred one, was more engaging and fun, helpful and useful, and offensive, rude, or toxic. The option with the better score was marked as “chosen”.

The model was then fine-tuned with DPO. Here are some of the metrics as can be seen in MLRun:

LLM Validation and Evaluation

Here are some sample results.

Example 1:

LLM Validation and Evaluation

Example 2:

LLM Validation and Evaluation
LLM Validation and Evaluation

Overall, 71% of crowdsourced uses preferred the new, fine-tuned model.


LLM evaluation enhances model reliability and efficiency while also ensuring it meets ethical standards and is free of biases. There are multiple ways to perform LLM evaluation, and they can be automated and streamlined through a gen AI architecture. This helps overcome the LLM production hurdle and operationalize generative AI applications so they bring business value.

To see the entire webinar and watch a demo of the chatbot in action, watch the full webinar recording here.

For personalized consultation services regarding AI implementation, let’s talk.