DeepChecks

Validate ML models, LLM applications, and AI agent decisions across every development stage

DeepChecks is profiled here as a Evaluation tool for engineering teams. Read about features, pricing, and how it compares to related options in the tools directory.

EvaluationOpen Source

Visit Website GitHub

Description

Deepchecks is an open-source ML and LLM testing platform founded in 2019 by Philip Tannor and Shir Chorev in Tel Aviv, both graduates of the IDF's Talpiot program and Unit 8200 intelligence unit, who had been working together since they were 18. The pair published an arXiv paper on ML testing methodology in March 2022 before raising a $14M seed round in June 2023, reflecting a research-first approach that distinguishes Deepchecks from most commercially-led testing tools. Check Point Software acquired Deepchecks in May 2026, integrating it into Check Point's Agentic Network Security Orchestration platform. The open-source library remains accessible under its original license, though the commercial platform's roadmap is now directed by Check Point's enterprise security priorities.

Key Capabilities

ML model validation across the development lifecycle: Systematic data validation, feature drift detection, model performance testing, and segmentation error analysis run at training, staging, and production stages, drawing directly on software CI/CD testing principles applied to ML
LLM evaluation with version comparison: Auto-scoring, business metric tracking, and side-by-side version comparison for LLM applications and RAG pipelines, covering answer quality, instruction following, and output faithfulness
Granular agent sub-task evaluation: Breaks complex agent executions into individual sub-tasks and scores each one using LLM judges, assessing tool selection, error recovery, and decision quality at both step and session level
Root cause analysis: Identifies the specific code-level origin of model failures rather than returning aggregate scores, reducing initial diagnosis time by up to 70% according to Deepchecks' own benchmarks
Flexible deployment including air-gapped environments: Runs as SaaS, virtual private cloud on GCP or Azure, bare-metal, or air-gapped, with native AWS integrations covering SageMaker Partner AI Apps, Bedrock, and GovCloud for regulated industries
Enterprise compliance and workflow integrations: SOC 2 Type 2, GDPR, and HIPAA compliance alongside Slack and PagerDuty integrations for routing validation alerts into existing operations workflows

See DeepChecks pricing details →

Alternative tools

Gentrace
Testing and evaluation for generative AI applications
HELM
Reproducible, multi-scenario benchmarking of foundation models
lm-evaluation-harness
Standard framework for benchmarking language models
garak
Vulnerability scanner for large language models
Evidently AI
Evaluate, test, and monitor traditional ML models and LLM applications from one framework
Vectara HHEM
Detect hallucinations in RAG outputs using a dedicated classification model

Used in Stacks

No saved stacks include this tool yet.

Browse more in Evaluation