DevExplore
  • Categories
  • Tools Directory
  • AI Stack Builder
  • Resources
  • Jobs
  • Advertise
AboutContactSign in
Home/Tools Directory/Deepeval
DevExplore

The discovery platform for developers

Platform

  • Categories
  • Tools Directory
  • AI Stack Builder
  • Resources
  • Jobs
  • Advertise

Community

  • Create account
  • Sign in
  • Submit a tool
  • Browse jobs

Company

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Service
  • Cookie Policy

Get Updates

Occasional product updates and curated picks. No spam.

    Ā© 2026 DevExplore. All rights reserved.

    About UsContact UsPrivacy PolicyTerms of ServiceCookie Policy
    1. Home
    2. /
    3. Tools Directory
    4. /
    5. DeepEval
    D

    Added 5/22/2026

    DeepEval

    Unit test your LLM applications the way you test Python code

    TestingOpen Source
    Visit Website

    Description

    šƒšžšžš©š„šÆššš„ is an open-source Python evaluation framework built by Confident AI that brings Pytest-style test ergonomics to LLM application development. Where Ragas targets RAG pipelines specifically, šƒšžšžš©š„šÆššš„ covers the full stack: RAG pipelines, AI agents, chatbots, and multi-modal applications. Backend engineers already familiar with Pytest can write LLM test suites using the same @pytest.mark.parametrize patterns and run them via deepeval test run without adopting a new conceptual framework.

    Key Capabilities:

    āœ“ Pytest-native test runner: Evaluation suites run through a CLI (deepeval test run) using standard Pytest decorators, with parallel execution support via the -n flag

    āœ“ Comprehensive metric library: Covers RAG metrics (Faithfulness, Contextual Precision, Contextual Recall), agent metrics (tool correctness, task efficiency, plan quality), and multi-turn conversational metrics (Knowledge Retention, Conversation Completeness)

    āœ“ G-Eval and DAG metrics: G-Eval scores outputs against any custom criteria using LLM-as-judge with chain-of-thought reasoning; DAG provides deterministic decision-tree-based scoring for use cases requiring stricter reproducibility

    āœ“ LLM benchmark runner: Executes canonical benchmarks including MMLU, HellaSwag, HumanEval, and GSM8K against any model in under 10 lines of Python

    āœ“ Synthetic dataset generation: Produces single-turn and multi-turn test case sets from a corpus, including goldens for conversational agent evaluation

    āœ“ CI/CD integration: Blocks deploys when evaluation scores fall below defined thresholds, with support for any CI/CD environment and direct integrations with LangChain, LangGraph, CrewAI, and OpenAI Agents

    Alternative tools

    • Claude Code

      Agentic coding tool that runs in your terminal

    • Patronus AI

      Score, benchmark, and stress-test LLM outputs for enterprise deployments

    • Harness

      AI-powered software delivery platform for the post-code lifecycle.

    • Spacelift

      IaC orchestration platform for Terraform, OpenTofu, and Pulumi teams.

    • Kiro

      AWS spec-driven AI IDE with GovCloud certification

    • CodeRabbit

      AI code review platform for pull requests and agent output

    Used in Stacks

    No saved stacks include this tool yet.

    Browse more in Testing