DeepEval

Unit test your LLM applications the way you test Python code

TestingOpen Source

Description

𝐃𝐞𝐞𝐩𝐄𝐯𝐚𝐥 is an open-source Python evaluation framework built by Confident AI that brings Pytest-style test ergonomics to LLM application development. Where Ragas targets RAG pipelines specifically, 𝐃𝐞𝐞𝐩𝐄𝐯𝐚𝐥 covers the full stack: RAG pipelines, AI agents, chatbots, and multi-modal applications. Backend engineers already familiar with Pytest can write LLM test suites using the same @pytest.mark.parametrize patterns and run them via deepeval test run without adopting a new conceptual framework.

Key Capabilities:

✓ Pytest-native test runner: Evaluation suites run through a CLI (deepeval test run) using standard Pytest decorators, with parallel execution support via the -n flag

✓ Comprehensive metric library: Covers RAG metrics (Faithfulness, Contextual Precision, Contextual Recall), agent metrics (tool correctness, task efficiency, plan quality), and multi-turn conversational metrics (Knowledge Retention, Conversation Completeness)

✓ G-Eval and DAG metrics: G-Eval scores outputs against any custom criteria using LLM-as-judge with chain-of-thought reasoning; DAG provides deterministic decision-tree-based scoring for use cases requiring stricter reproducibility

✓ LLM benchmark runner: Executes canonical benchmarks including MMLU, HellaSwag, HumanEval, and GSM8K against any model in under 10 lines of Python

✓ Synthetic dataset generation: Produces single-turn and multi-turn test case sets from a corpus, including goldens for conversational agent evaluation

✓ CI/CD integration: Blocks deploys when evaluation scores fall below defined thresholds, with support for any CI/CD environment and direct integrations with LangChain, LangGraph, CrewAI, and OpenAI Agents

Alternative tools

Claude Code
Agentic coding tool that runs in your terminal
OpenAI Codex CLI
Terminal coding agent built on OpenAI reasoning models
Aider
AI pair programming in your terminal
Cline
Open-source AI coding agent for any editor
Braintrust Evals
Trace every step your LLM agent takes, from prompt to response
Giskard
Scan AI agents for vulnerabilities before and after deployment

Used in Stacks

No saved stacks include this tool yet.

Browse more in Testing