Arize Phoenix
Open-source LLM observability and evaluation with native OpenTelemetry tracing and self-hosting
Arize Phoenix is an open-source LLM observability and evaluation platform from Arize AI. It uses OpenTelemetry as its tracing standard, auto-instruments major LLM frameworks, and runs dataset-based evaluations with LLM-as-judge or custom evaluators. Available as a self-hosted open-source deployment or through Arize's managed cloud service. Strong fit for teams already using OpenTelemetry infrastructure.
OpenTelemetry has become the standard for distributed tracing across the software industry. Your Kubernetes services emit OTel spans. Your databases emit OTel spans. Your message queues emit OTel spans. And then you add an LLM agent, and suddenly you're in a completely separate observability world with framework-specific callbacks and vendor-specific SDKs that don't connect to anything you already run.
Arize Phoenix is built on the premise that this shouldn't be true. LLM traces should be OpenTelemetry traces. The agent span that calls a tool should connect to the service span that served the response. The retrieval step should be observable the same way a database query is observable.
What Arize Phoenix is
Arize Phoenix is an open-source LLM observability and evaluation platform. It's built by Arize AI, a company that has been building ML observability tools since 2020. Phoenix is the open-source product; Arize's broader platform adds enterprise features like production ML monitoring, model explainability, and larger-scale data infrastructure.
The repository is at Arize-ai/phoenix, licensed under Apache-2.0, with approximately 6,800 stars as of late April 2026. Active development is backed by Arize AI's commercial interests, which is a meaningful difference from purely community-maintained open-source projects.
Phoenix is available as a self-hosted deployment and through Arize's managed cloud service. The self-hosted option is fully featured at no cost.
OpenTelemetry as a first principle
The standard OTEL tracing model describes a trace as a tree of spans. Each span has a name, start time, duration, attributes (key-value metadata), and optional child spans. This model maps well to LLM workflows: a trace per user request, spans for retrieval, reranking, LLM generation, and tool calls.
The challenge is that standard OTEL attributes don't know about LLM-specific concepts like prompt templates, token counts, or embedding vectors. Arize created OpenInference, a semantic convention layer that extends OTel with LLM-specific attribute names. When a LangChain call happens inside a Phoenix-traced workflow, the span includes:
llm.model_namewith the model identifierllm.token_count.prompt,llm.token_count.completion,llm.token_count.totalllm.input_messagesandllm.output_messageswith the full conversationretrieval.documentswith retrieved chunks and their scoresembedding.embeddingswith the vector data if applicable
These conventions mean Phoenix can parse the LLM-specific data and display it meaningfully, rather than showing raw key-value spans that you have to interpret manually.
Getting started with auto-instrumentation
Phoenix provides one-line instrumentation for major frameworks:
pip install arize-phoenix openinference-instrumentation-openai openinference-instrumentation-langchain
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor
# Start the Phoenix server (locally for development)
session = px.launch_app()
# Instrument your LLM frameworks
OpenAIInstrumentor().instrument()
LangChainInstrumentor().instrument()
# Now all OpenAI and LangChain calls are traced automatically
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain how transformers work."}]
)
# The trace appears in Phoenix at http://localhost:6006
print(session.url)
The launch_app() call starts a local Phoenix server. All subsequent LLM calls are captured and available in the web UI at the printed URL. For production, you configure Phoenix as a remote endpoint:
import phoenix as px
# Point to your self-hosted Phoenix server or cloud endpoint
px.init(endpoint="http://your-phoenix-server:6006/v1/traces")
For LlamaIndex:
from openinference.instrumentation.llamaindex import LlamaIndexInstrumentor
LlamaIndexInstrumentor().instrument()
For DSPy:
from openinference.instrumentation.dspy import DSPyInstrumentor
DSPyInstrumentor().instrument()
Each instrumentor patches the framework at the module level. Once you call instrument(), every call through that framework creates spans automatically.
Tracing multi-step agents
For agents with custom orchestration, you can create spans manually using the OTEL API or Phoenix's wrapper:
from opentelemetry import trace
tracer = trace.get_tracer("my-agent")
def run_research_agent(question: str) -> str:
with tracer.start_as_current_span("research-agent") as agent_span:
agent_span.set_attribute("input.value", question)
with tracer.start_as_current_span("document-retrieval") as retrieval_span:
docs = retrieve_documents(question)
retrieval_span.set_attribute("retrieval.documents", str(docs))
with tracer.start_as_current_span("synthesis") as synthesis_span:
answer = generate_answer(docs, question)
synthesis_span.set_attribute("output.value", answer)
agent_span.set_attribute("output.value", answer)
return answer
This is standard OpenTelemetry code. The Phoenix-specific part is the OTLP exporter you configure, which points to Phoenix's ingestion endpoint. The advantage is that these same spans flow into your existing OTel infrastructure if you want to correlate LLM agent traces with traces from the rest of your distributed system.
Embedding visualization for RAG debugging
One of Phoenix's genuinely distinctive features is its embedding visualization. When your traces include embedding vectors (from retrieval operations, for example), Phoenix renders them in a 2D projection using UMAP.
In a RAG system, this makes retrieval quality visible in a way that log inspection doesn't. You can see:
- Which user queries cluster together
- Which documents the retrieval step is returning for different query types
- Whether queries that produce bad answers cluster in a specific region of the embedding space
- How retrieval behavior changes after you update your embedding model or chunking strategy
This is particularly useful for debugging the "retrieval is returning the wrong documents" class of RAG failures. Looking at a trace log, you see the documents that were returned. Looking at the embedding visualization, you see why the retrieval system pulled those documents instead of the ones that would have actually helped.
It's not a feature you need day to day, but when retrieval quality is the core problem, nothing else gives you this visual handle on what's happening.
Dataset-based evaluation
Phoenix's evaluation workflow runs your application against a dataset and scores the outputs:
import phoenix as px
from phoenix.evals import (
HallucinationEvaluator,
QAEvaluator,
run_evals,
)
from phoenix.evals.models import OpenAIModel
# Load traces from Phoenix into a dataframe
client = px.Client()
trace_df = client.get_spans_dataframe()
# Run evaluations on the traces
model = OpenAIModel(model="gpt-4o", api_key="your-key")
hallucination_evaluator = HallucinationEvaluator(model)
qa_evaluator = QAEvaluator(model)
results = run_evals(
dataframe=trace_df,
evaluators=[hallucination_evaluator, qa_evaluator],
provide_explanation=True,
)
The evaluation runs against spans you've already captured, which is a useful pattern: you capture production traces for a period, pull them into a dataset, and run evaluations after the fact. This lets you assess quality on real-world inputs without running the evaluation in the critical path.
Phoenix ships pre-built evaluators for hallucination detection, Q&A correctness, relevance, toxicity, and summarization quality. Custom evaluators are straightforward Python functions or LLM-as-judge configurations.
Self-hosting
Phoenix is the most straightforward of the major open-source options for deployment in a containerized environment:
# Pull and run Phoenix
docker run -p 6006:6006 arizephoenix/phoenix:latest
That's the basic deployment. Phoenix stores data in SQLite by default, which works for development and small production workloads. For production, configure PostgreSQL:
docker run -p 6006:6006 \
-e PHOENIX_SQL_DATABASE_URL=postgresql://user:password@host/dbname \
arizephoenix/phoenix:latest
A Helm chart is available for Kubernetes deployments. The self-hosted infrastructure is simpler than Helicone's Clickhouse-based stack but slightly more involved than Langfuse's single-Postgres setup, mainly because Phoenix handles more data types (embeddings, spans, evaluations) in its schema.
Data residency teams get genuine isolation: all traces stay in your infrastructure, and there's no telemetry or call-home behavior in the open-source version.
Phoenix vs the alternatives
Phoenix vs Langfuse
The natural comparison given both are Apache/MIT licensed and self-hostable. Phoenix wins on OpenTelemetry integration depth and embedding visualization. Langfuse wins on prompt management maturity, simpler self-hosted infrastructure, and a more polished UI for non-technical stakeholders. Teams embedded in OTel infrastructure favor Phoenix. Teams that want simpler setup and better prompt management favor Langfuse. Both are strong; this is a preference call, not a clear winner.
Phoenix vs LangSmith
LangSmith is tighter with LangChain and has better production monitoring dashboards. Phoenix is self-hostable without an enterprise license and has OpenTelemetry-native tracing that works across any framework. Teams that can't send data to LangChain's cloud infrastructure and need observability should use Phoenix or Langfuse. Teams deeply invested in LangChain who can use cloud SaaS should use LangSmith.
Phoenix vs Helicone
Different abstraction levels. Helicone proxies at the HTTP layer and gives you API call visibility. Phoenix traces at the application layer and gives you orchestration visibility. For basic cost and usage monitoring, Helicone is faster to set up. For understanding agent behavior at the code level, Phoenix provides substantially more information.
Pricing and running costs
The self-hosted open-source deployment costs nothing beyond the infrastructure you run it on. A basic Phoenix server on a small cloud instance handles moderate trace volumes comfortably. For high-volume production workloads, sizing the PostgreSQL database is the main operational consideration.
Phoenix Cloud, the managed service, has a free tier with limited data retention. Paid cloud tiers start around $50/month. The commercial Arize AI platform, which adds enterprise ML monitoring features beyond LLM observability, is priced separately through sales.
For teams that can self-host, the total cost is infrastructure plus operational effort. For teams that need a managed service without enterprise LangSmith or Langfuse pricing, Phoenix Cloud's rates are competitive.
Who should use Arize Phoenix
Phoenix is a strong fit for:
Teams already running OpenTelemetry infrastructure. If your services emit OTel spans and you want LLM agent traces to live in the same observability stack, Phoenix's native OTel design makes the integration natural. Other tools require separate SDKs and separate dashboards.
Teams with data residency requirements that rule out cloud SaaS. Apache-2.0 self-hosting with no per-event fees or license restrictions is a hard requirement for many enterprises, and Phoenix delivers it as cleanly as Langfuse.
RAG application developers who need retrieval debugging tools. The embedding visualization is a genuine differentiator. If you're spending time debugging why your retrieval step returns the wrong documents, Phoenix gives you a visual diagnostic tool that other observability products don't provide.
Teams using DSPy, Haystack, or less common frameworks. Phoenix's OTLP endpoint accepts spans from any OTel-compatible source. For frameworks that aren't first-class citizens in the LangSmith or Langfuse ecosystem, Phoenix's standard-based approach often provides the cleanest integration path.
The verdict
Arize Phoenix is the right open-source observability tool for teams who take OpenTelemetry seriously. The OTel-native design, combined with the OpenInference semantic conventions, means LLM traces are first-class citizens in standard distributed tracing infrastructure rather than a parallel system that only talks to itself.
The practical tradeoffs are a slightly more involved self-hosted setup and a less polished UI than commercial alternatives. For teams that prioritize standards compliance, embedding visualization, and open-source sustainability backed by a commercial company, Phoenix is the tool that earns that preference.
If you're starting fresh and just want the simplest self-hosted setup with strong prompt management, Langfuse is slightly easier to get running. If you're serious about OTel integration or you need retrieval debugging tools, Phoenix is the better choice.
Key features
- OpenTelemetry-native tracing for LLM calls, agent steps, and retrieval spans
- One-line auto-instrumentation for LangChain, LlamaIndex, OpenAI, and DSPy
- Dataset-based evaluation with LLM-as-judge and code evaluators
- Embedding visualization for debugging retrieval quality in RAG systems
- Prompt template versioning and comparison
- Human annotation queue for labeling and feedback collection
- Real-time production monitoring with span-level filtering
- OTLP endpoint for ingesting traces from any OTel-compatible framework