MLflow

Track experiments, manage models, and evaluate LLM applications across the full ML lifecycle

MLflow is profiled here as a Observability tool for engineering teams. Read about features, pricing, and how it compares to related options in the tools directory.

ObservabilityOpen Source

Visit Website GitHub

Description

MLflow is an Apache 2.0 open-source platform built by Databricks, first released in June 2018 by Matei Zaharia, who also created Apache Spark and co-founded Databricks with six colleagues from UC Berkeley. Zaharia built MLflow after observing the same pattern across hundreds of Databricks enterprise customers: data science teams tracked experiments in spreadsheets and notebooks, then couldn't reconstruct the exact conditions that produced a promising model. MLflow 3.0, released June 2025, extended that same reproducibility philosophy to GenAI, adding LLM tracing, quality evaluation, prompt versioning, and feedback collection without requiring a separate observability platform. With 30 million monthly downloads, 850+ contributors, and adoption across 5,000+ organizations, MLflow sits at a different scale from every other tool in the Testing category.

Key Capabilities

Experiment tracking: Logs hyperparameters, metrics, artifacts, and source code for every training run in a centralized repository, making it possible to reproduce any prior experiment exactly and compare runs across parameters and dataset versions
Model registry with versioning: A production model registry handles staging transitions, access controls, and webhooks for automated deployment events across scikit-learn, TensorFlow, PyTorch, XGBoost, Hugging Face, and Spark MLlib in a unified packaging format
LLM tracing and agent observability (MLflow 3.0): Records inputs, outputs, and metadata for every intermediate step in an LLM call chain or agent workflow, providing the same granular trace visibility for GenAI applications that experiment tracking provides for traditional ML
Quality evaluation with LLM judges (MLflow 3.0): Built-in and custom scorers run LLM-as-judge evaluation against production traces, with a revamped UI for reviewing scores and a feedback collection API for incorporating human expert ratings
Prompt versioning and AI Gateway (MLflow 3.0): Version-controls LLM prompts and application configurations alongside model artifacts, and provides a unified gateway layer for managing LLM provider access with cost controls and rate limiting
Multi-language SDKs and framework-agnostic integration: Python, TypeScript, JavaScript, Java, and R SDKs connect to any LLM provider, agent framework, or ML library, with a managed offering on Databricks that adds Unity Catalog governance and fully hosted infrastructure for enterprise teams

Alternative tools

HoneyHive
Evaluation and observability platform for AI agents
Sentry
Error tracking and performance monitoring for developers
SigNoz
Open-source, OpenTelemetry-native observability platform
Datadog
Unified observability for metrics, traces, and logs
Arize AX
Enterprise platform for AI observability and evaluation
OpenTelemetry
Vendor-neutral standard for traces, metrics, and logs

Used in Stacks

No saved stacks include this tool yet.

Browse more in Observability