DeepEval
Unit test your LLM applications the way you test Python code
Description
šššš©ššÆšš„ is an open-source Python evaluation framework built by Confident AI that brings Pytest-style test ergonomics to LLM application development. Where Ragas targets RAG pipelines specifically, šššš©ššÆšš„ covers the full stack: RAG pipelines, AI agents, chatbots, and multi-modal applications. Backend engineers already familiar with Pytest can write LLM test suites using the same @pytest.mark.parametrize patterns and run them via deepeval test run without adopting a new conceptual framework.
Key Capabilities:
ā Pytest-native test runner: Evaluation suites run through a CLI (deepeval test run) using standard Pytest decorators, with parallel execution support via the -n flag
ā Comprehensive metric library: Covers RAG metrics (Faithfulness, Contextual Precision, Contextual Recall), agent metrics (tool correctness, task efficiency, plan quality), and multi-turn conversational metrics (Knowledge Retention, Conversation Completeness)
ā G-Eval and DAG metrics: G-Eval scores outputs against any custom criteria using LLM-as-judge with chain-of-thought reasoning; DAG provides deterministic decision-tree-based scoring for use cases requiring stricter reproducibility
ā LLM benchmark runner: Executes canonical benchmarks including MMLU, HellaSwag, HumanEval, and GSM8K against any model in under 10 lines of Python
ā Synthetic dataset generation: Produces single-turn and multi-turn test case sets from a corpus, including goldens for conversational agent evaluation
ā CI/CD integration: Blocks deploys when evaluation scores fall below defined thresholds, with support for any CI/CD environment and direct integrations with LangChain, LangGraph, CrewAI, and OpenAI Agents
Alternative tools
- Claude Code
Agentic coding tool that runs in your terminal
- OpenAI Codex CLI
Terminal coding agent built on OpenAI reasoning models
- Aider
AI pair programming in your terminal
- Cline
Open-source AI coding agent for any editor
- Braintrust Evals
Trace every step your LLM agent takes, from prompt to response
- Giskard
Scan AI agents for vulnerabilities before and after deployment
