Python Apache-2.0 observabilityevaluationopen-source

Arize Phoenix

Open-source LLM observability and evaluation with native OpenTelemetry tracing and self-hosting

Arize Phoenix is an open-source LLM observability and evaluation platform from Arize AI. It uses OpenTelemetry as its tracing standard, auto-instruments major LLM frameworks, and runs dataset-based evaluations with LLM-as-judge or custom evaluators. Available as a self-hosted open-source deployment or through Arize's managed cloud service. Strong fit for teams already using OpenTelemetry infrastructure.

OpenTelemetry has become the standard for distributed tracing across the software industry. Your Kubernetes services emit OTel spans. Your databases emit OTel spans. Your message queues emit OTel spans. And then you add an LLM agent, and suddenly you're in a completely separate observability world with framework-specific callbacks and vendor-specific SDKs that don't connect to anything you already run.

Arize Phoenix is built on the premise that this shouldn't be true. LLM traces should be OpenTelemetry traces. The agent span that calls a tool should connect to the service span that served the response. The retrieval step should be observable the same way a database query is observable.

What Arize Phoenix is

Arize Phoenix is an open-source LLM observability and evaluation platform. It's built by Arize AI, a company that has been building ML observability tools since 2020. Phoenix is the open-source product; Arize's broader platform adds enterprise features like production ML monitoring, model explainability, and larger-scale data infrastructure.

The repository is at Arize-ai/phoenix, licensed under Apache-2.0, with approximately 6,800 stars as of late April 2026. Active development is backed by Arize AI's commercial interests, which is a meaningful difference from purely community-maintained open-source projects.

Phoenix is available as a self-hosted deployment and through Arize's managed cloud service. The self-hosted option is fully featured at no cost.

OpenTelemetry as a first principle

The standard OTEL tracing model describes a trace as a tree of spans. Each span has a name, start time, duration, attributes (key-value metadata), and optional child spans. This model maps well to LLM workflows: a trace per user request, spans for retrieval, reranking, LLM generation, and tool calls.

The challenge is that standard OTEL attributes don't know about LLM-specific concepts like prompt templates, token counts, or embedding vectors. Arize created OpenInference, a semantic convention layer that extends OTel with LLM-specific attribute names. When a LangChain call happens inside a Phoenix-traced workflow, the span includes:

llm.model_name with the model identifier
llm.token_count.prompt, llm.token_count.completion, llm.token_count.total
llm.input_messages and llm.output_messages with the full conversation
retrieval.documents with retrieved chunks and their scores
embedding.embeddings with the vector data if applicable

These conventions mean Phoenix can parse the LLM-specific data and display it meaningfully, rather than showing raw key-value spans that you have to interpret manually.

Getting started with auto-instrumentation

Phoenix provides one-line instrumentation for major frameworks:

pip install arize-phoenix openinference-instrumentation-openai openinference-instrumentation-langchain

import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor

# Start the Phoenix server (locally for development)
session = px.launch_app()

# Instrument your LLM frameworks
OpenAIInstrumentor().instrument()
LangChainInstrumentor().instrument()

# Now all OpenAI and LangChain calls are traced automatically
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain how transformers work."}]
)

# The trace appears in Phoenix at http://localhost:6006
print(session.url)

The launch_app() call starts a local Phoenix server. All subsequent LLM calls are captured and available in the web UI at the printed URL. For production, you configure Phoenix as a remote endpoint:

import phoenix as px

# Point to your self-hosted Phoenix server or cloud endpoint
px.init(endpoint="http://your-phoenix-server:6006/v1/traces")

For LlamaIndex:

from openinference.instrumentation.llamaindex import LlamaIndexInstrumentor

LlamaIndexInstrumentor().instrument()

For DSPy:

from openinference.instrumentation.dspy import DSPyInstrumentor

DSPyInstrumentor().instrument()

Each instrumentor patches the framework at the module level. Once you call instrument(), every call through that framework creates spans automatically.

Tracing multi-step agents

For agents with custom orchestration, you can create spans manually using the OTEL API or Phoenix's wrapper:

from opentelemetry import trace

tracer = trace.get_tracer("my-agent")

def run_research_agent(question: str) -> str:
    with tracer.start_as_current_span("research-agent") as agent_span:
        agent_span.set_attribute("input.value", question)

        with tracer.start_as_current_span("document-retrieval") as retrieval_span:
            docs = retrieve_documents(question)
            retrieval_span.set_attribute("retrieval.documents", str(docs))

        with tracer.start_as_current_span("synthesis") as synthesis_span:
            answer = generate_answer(docs, question)
            synthesis_span.set_attribute("output.value", answer)

        agent_span.set_attribute("output.value", answer)
        return answer

This is standard OpenTelemetry code. The Phoenix-specific part is the OTLP exporter you configure, which points to Phoenix's ingestion endpoint. The advantage is that these same spans flow into your existing OTel infrastructure if you want to correlate LLM agent traces with traces from the rest of your distributed system.

Embedding visualization for RAG debugging

One of Phoenix's genuinely distinctive features is its embedding visualization. When your traces include embedding vectors (from retrieval operations, for example), Phoenix renders them in a 2D projection using UMAP.

In a RAG system, this makes retrieval quality visible in a way that log inspection doesn't. You can see:

Which user queries cluster together
Which documents the retrieval step is returning for different query types
Whether queries that produce bad answers cluster in a specific region of the embedding space
How retrieval behavior changes after you update your embedding model or chunking strategy

This is particularly useful for debugging the "retrieval is returning the wrong documents" class of RAG failures. Looking at a trace log, you see the documents that were returned. Looking at the embedding visualization, you see why the retrieval system pulled those documents instead of the ones that would have actually helped.

It's not a feature you need day to day, but when retrieval quality is the core problem, nothing else gives you this visual handle on what's happening.

Dataset-based evaluation

Phoenix's evaluation workflow runs your application against a dataset and scores the outputs:

import phoenix as px
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    run_evals,
)
from phoenix.evals.models import OpenAIModel

# Load traces from Phoenix into a dataframe
client = px.Client()
trace_df = client.get_spans_dataframe()

# Run evaluations on the traces
model = OpenAIModel(model="gpt-4o", api_key="your-key")
hallucination_evaluator = HallucinationEvaluator(model)
qa_evaluator = QAEvaluator(model)

results = run_evals(
    dataframe=trace_df,
    evaluators=[hallucination_evaluator, qa_evaluator],
    provide_explanation=True,
)

The evaluation runs against spans you've already captured, which is a useful pattern: you capture production traces for a period, pull them into a dataset, and run evaluations after the fact. This lets you assess quality on real-world inputs without running the evaluation in the critical path.

Phoenix ships pre-built evaluators for hallucination detection, Q&A correctness, relevance, toxicity, and summarization quality. Custom evaluators are straightforward Python functions or LLM-as-judge configurations.

Self-hosting

Phoenix is the most straightforward of the major open-source options for deployment in a containerized environment:

# Pull and run Phoenix
docker run -p 6006:6006 arizephoenix/phoenix:latest

That's the basic deployment. Phoenix stores data in SQLite by default, which works for development and small production workloads. For production, configure PostgreSQL:

docker run -p 6006:6006 \
  -e PHOENIX_SQL_DATABASE_URL=postgresql://user:password@host/dbname \
  arizephoenix/phoenix:latest

A Helm chart is available for Kubernetes deployments. The self-hosted infrastructure is simpler than Helicone's Clickhouse-based stack but slightly more involved than Langfuse's single-Postgres setup, mainly because Phoenix handles more data types (embeddings, spans, evaluations) in its schema.

Data residency teams get genuine isolation: all traces stay in your infrastructure, and there's no telemetry or call-home behavior in the open-source version.

Phoenix vs the alternatives

Phoenix vs Langfuse

The natural comparison given both are Apache/MIT licensed and self-hostable. Phoenix wins on OpenTelemetry integration depth and embedding visualization. Langfuse wins on prompt management maturity, simpler self-hosted infrastructure, and a more polished UI for non-technical stakeholders. Teams embedded in OTel infrastructure favor Phoenix. Teams that want simpler setup and better prompt management favor Langfuse. Both are strong; this is a preference call, not a clear winner.

Phoenix vs LangSmith

LangSmith is tighter with LangChain and has better production monitoring dashboards. Phoenix is self-hostable without an enterprise license and has OpenTelemetry-native tracing that works across any framework. Teams that can't send data to LangChain's cloud infrastructure and need observability should use Phoenix or Langfuse. Teams deeply invested in LangChain who can use cloud SaaS should use LangSmith.

Phoenix vs Helicone

Different abstraction levels. Helicone proxies at the HTTP layer and gives you API call visibility. Phoenix traces at the application layer and gives you orchestration visibility. For basic cost and usage monitoring, Helicone is faster to set up. For understanding agent behavior at the code level, Phoenix provides substantially more information.

Pricing and running costs

The self-hosted open-source deployment costs nothing beyond the infrastructure you run it on. A basic Phoenix server on a small cloud instance handles moderate trace volumes comfortably. For high-volume production workloads, sizing the PostgreSQL database is the main operational consideration.

Phoenix Cloud, the managed service, has a free tier with limited data retention. Paid cloud tiers start around $50/month. The commercial Arize AI platform, which adds enterprise ML monitoring features beyond LLM observability, is priced separately through sales.

For teams that can self-host, the total cost is infrastructure plus operational effort. For teams that need a managed service without enterprise LangSmith or Langfuse pricing, Phoenix Cloud's rates are competitive.

Who should use Arize Phoenix

Phoenix is a strong fit for:

Teams already running OpenTelemetry infrastructure. If your services emit OTel spans and you want LLM agent traces to live in the same observability stack, Phoenix's native OTel design makes the integration natural. Other tools require separate SDKs and separate dashboards.

Teams with data residency requirements that rule out cloud SaaS. Apache-2.0 self-hosting with no per-event fees or license restrictions is a hard requirement for many enterprises, and Phoenix delivers it as cleanly as Langfuse.

RAG application developers who need retrieval debugging tools. The embedding visualization is a genuine differentiator. If you're spending time debugging why your retrieval step returns the wrong documents, Phoenix gives you a visual diagnostic tool that other observability products don't provide.

Teams using DSPy, Haystack, or less common frameworks. Phoenix's OTLP endpoint accepts spans from any OTel-compatible source. For frameworks that aren't first-class citizens in the LangSmith or Langfuse ecosystem, Phoenix's standard-based approach often provides the cleanest integration path.

The verdict

Arize Phoenix is the right open-source observability tool for teams who take OpenTelemetry seriously. The OTel-native design, combined with the OpenInference semantic conventions, means LLM traces are first-class citizens in standard distributed tracing infrastructure rather than a parallel system that only talks to itself.

The practical tradeoffs are a slightly more involved self-hosted setup and a less polished UI than commercial alternatives. For teams that prioritize standards compliance, embedding visualization, and open-source sustainability backed by a commercial company, Phoenix is the tool that earns that preference.

If you're starting fresh and just want the simplest self-hosted setup with strong prompt management, Langfuse is slightly easier to get running. If you're serious about OTel integration or you need retrieval debugging tools, Phoenix is the better choice.

Key features

OpenTelemetry-native tracing for LLM calls, agent steps, and retrieval spans
One-line auto-instrumentation for LangChain, LlamaIndex, OpenAI, and DSPy
Dataset-based evaluation with LLM-as-judge and code evaluators
Embedding visualization for debugging retrieval quality in RAG systems
Prompt template versioning and comparison
Human annotation queue for labeling and feedback collection
Real-time production monitoring with span-level filtering
OTLP endpoint for ingesting traces from any OTel-compatible framework

Frequently Asked Questions

What is Arize Phoenix?

Arize Phoenix is an open-source observability and evaluation platform for LLM applications, built by Arize AI. It traces LLM calls, agent steps, and retrieval operations using OpenTelemetry as the underlying standard. Beyond tracing, it provides dataset-based evaluation with LLM-as-judge and custom evaluators, embedding visualization for RAG debugging, and prompt versioning. It's available as a self-hosted open-source deployment under Apache-2.0 or through Arize's managed cloud service.

How does Arize Phoenix differ from Langfuse?

Both are open-source and self-hostable, which makes them natural alternatives. The key differences are in architecture and depth. Phoenix is built around OpenTelemetry natively, which matters if you already run an OTel stack or need traces to integrate with existing infrastructure. Langfuse uses its own SDK with an OTel ingestion endpoint as a secondary path. Phoenix has stronger embedding visualization for RAG debugging. Langfuse has a more mature prompt management workflow and simpler self-hosted infrastructure (just Postgres). Both are strong choices for teams that need self-hosting.

Is Arize Phoenix actually free to self-host?

Yes. Phoenix is Apache-2.0 licensed and designed for self-hosting at no cost. You run the Phoenix server, connect your application using the `arize-phoenix` or `openinference` instrumentation libraries, and traces are stored locally. There are no per-event fees, no telemetry requirements, and no license restrictions on self-hosted use. The commercial Arize AI platform offers enterprise-grade production monitoring and support, but the open-source Phoenix covers most teams' observability needs.

What is OpenInference?

OpenInference is the semantic convention layer that Arize created to standardize how LLM-specific data is represented in OpenTelemetry spans. Standard OTEL spans cover general tracing attributes like timing and service names. OpenInference adds conventions for LLM-specific fields: the prompt template, token counts, model name, embedding vectors, retrieval results, and tool call details. Phoenix uses OpenInference to give LLM traces a consistent structure that the UI can parse and display meaningfully. Other tools like LlamaIndex and LangChain now emit OpenInference-formatted spans natively.

Does Arize Phoenix work with frameworks other than LangChain?

Yes. Phoenix's auto-instrumentation covers LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, DSPy, Haystack, Mistral, Cohere, and AWS Bedrock. For frameworks without dedicated instrumentation, Phoenix accepts any OTLP-formatted traces through its standard OpenTelemetry endpoint. This means any application or framework that can emit OTel spans can send traces to Phoenix without a dedicated integration, which gives Phoenix broader compatibility than tools that rely on framework-specific SDK callbacks.