Braintrust Evals

Trace every step your LLM agent takes, from prompt to response

Description

Arize Phoenix is an open-source AI observability and evaluation platform built by Arize AI, a machine learning observability vendor founded in 2020. Phoenix launched in May 2023 at Arize's Observe summit, bringing the spans-and-traces model familiar from traditional APM tools into LLM application development. Unlike most OSS observability tools that reserve advanced features for paid tiers, Phoenix ships are fully featured with no feature gates. The commercial counterpart, Arize AX, serves enterprise teams that need RBAC, SSO, audit trails, and higher trace volumes, but the open-source library itself is not artificially limited.

Key Capabilities

OpenTelemetry-native tracing: Phoenix instruments LLM calls, retrieval steps, tool executions, and agent reasoning through OpenInference, an open OTel-based telemetry standard that Arize maintains, keeping trace data portable across observability platforms
Broad framework and provider support: Auto-instrumentation covers the OpenAI Agents SDK, Claude Agent SDK, LangGraph, LlamaIndex, CrewAI, DSPy, Vercel AI SDK, and Mastra, alongside providers including Anthropic, AWS Bedrock, Google GenAI, and LiteLLM
LLM evals library: Pre-built evaluation templates for hallucination, summarization, and retrieval relevance run against any traced span, with support for custom LLM-as-judge templates and human annotation queues
Datasets and experiments: Traces group into datasets that run through different application versions side-by-side, producing comparison results that confirm whether a prompt or architecture change produced a measurable improvement
RAG embedding analysis: Clusters query and knowledge base embeddings to surface missing context, irrelevant retrievals, and semantically similar failure cases without manual log inspection
Span replay and prompt playground: Any production span replays with modified inputs for targeted debugging, and the prompt playground runs side-by-side model comparisons without leaving the Phoenix interface

See Braintrust Evals pricing details →

Alternative tools

Claude Code
Agentic coding tool that runs in your terminal
OpenAI Codex CLI
Terminal coding agent built on OpenAI reasoning models
Aider
AI pair programming in your terminal
Cline
Open-source AI coding agent for any editor
Giskard
Scan AI agents for vulnerabilities before and after deployment
Promptfoo
Test and red team LLM applications from the command line

Used in Stacks

No saved stacks include this tool yet.

Browse more in Testing