Patronus AI
Score, benchmark, and stress-test LLM outputs for enterprise deployments
Description
Patronus AI is a closed-source LLM evaluation platform founded in September 2023 by Anand Kannappan and Rebecca Qian, who previously led explainable ML and responsible NLP research at Meta Reality Labs and Meta AI (FAIR) respectively. That research background informs the platform's architecture: rather than wrapping a general-purpose LLM to judge outputs, Patronus trains dedicated evaluation models for specific failure modes. The flagship model, Lynx, is a 70B-parameter hallucination detection model released as open weights in June 2024, with benchmark results showing it outperforms GPT-4 on identifying factual mistakes in LLM outputs.
Key Capabilities
Lynx hallucination detection model: A 70B open-weight model fine-tuned for identifying hallucinations, factual errors, and refusals, available independently of the Patronus platform for teams that need a standalone hallucination scorer
GLIDER general judge: A proprietary general-purpose LLM evaluation model that scores outputs across quality dimensions beyond hallucination, including style, tone, and brand alignment
Adversarial test suite generation: Automatically generates stress-test cases targeting 50+ failure mode categories, including PII disclosure, copyright infringement, safety violations, and domain-specific accuracy gaps
Percival agent debugger: Traces multi-step agent executions and detects 20+ agentic failure modes including planning errors, tool misuse, and goal misalignment across the full agent run
Generative Simulators: Adaptive testing environments that dynamically generate agent scenarios at scale rather than running agents against static evaluation datasets
FinanceBench domain benchmark: A financial domain evaluation benchmark co-developed with 15 financial industry experts, used to surface the finding that leading LLMs hallucinated on up to 81% of financial analyst questions
Alternative tools
- Claude Code
Agentic coding tool that runs in your terminal
- Harness
AI-powered software delivery platform for the post-code lifecycle.
- Spacelift
IaC orchestration platform for Terraform, OpenTofu, and Pulumi teams.
- Kiro
AWS spec-driven AI IDE with GovCloud certification
- CodeRabbit
AI code review platform for pull requests and agent output
- Qodo
AI code review platform built around code integrity
