Overview
The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use agenta to quickly evaluate and compare the performance of your LLM applications.
Set up evaluation
📄️ Configure Evaluators
Configure evaluators for your use case
📄️ Create Test Sets
Create Test Sets
Run evaluations
📄️ Run Evaluations from the web UI
Learn about the evaluation process in Agenta
📄️ Run Evaluations with the SDK
Learn about the evaluation process in Agenta
Available evaluators
Evaluator Name | Use Case | Type | Description |
---|---|---|---|
Exact Match | Classification/Entity Extraction | Pattern Matching | Checks if the output exactly matches the expected result. |
Contains JSON | Classification/Entity Extraction | Pattern Matching | Ensures the output contains valid JSON. |
Regex Test | Classification/Entity Extraction | Pattern Matching | Checks if the output matches a given regex pattern. |
JSON Field Match | Classification/Entity Extraction | Pattern Matching | Compares specific fields within JSON data. |
JSON Diff Match | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values. |
Similarity Match | Text Generation / Chatbot | Similarity Metrics | Compares generated output with expected using Jaccard similarity. |
Semantic Similarity Match | Text Generation / Chatbot | Semantic Analysis | Compares the meaning of the generated output with the expected result. |
Starts With | Text Generation / Chatbot | Pattern Matching | Checks if the output starts with a specified prefix. |
Ends With | Text Generation / Chatbot | Pattern Matching | Checks if the output ends with a specified suffix. |
Contains | Text Generation / Chatbot | Pattern Matching | Checks if the output contains a specific substring. |
Contains Any | Text Generation / Chatbot | Pattern Matching | Checks if the output contains any of a list of substrings. |
Contains All | Text Generation / Chatbot | Pattern Matching | Checks if the output contains all of a list of substrings. |
Levenshtein Distance | Text Generation / Chatbot | Similarity Metrics | Calculates the Levenshtein distance between output and expected result. |
LLM-as-a-judge | Text Generation / Chatbot | LLM-based | Sends outputs to an LLM model for critique and evaluation. |
RAG Faithfulness | RAG / Text Generation / Chatbot | LLM-based | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
RAG Context Relevancy | RAG / Text Generation / Chatbot | LLM-based | Measures the relevancy of retrieved documents to the given question in RAG. |
Custom Code Evaluation | Custom Logic | Custom | Allows users to define their own evaluator in Python. |
Webhook Evaluator | Custom Logic | Custom | Sends output to a webhook for external evaluation. |