Open Source v0.1.2 GitHub stars

See what your AI agents are
actually doing.

The first MCP-native eval and observability tool. Log every trace, evaluate output quality, track costs across all your agents. Open-source core. One command to start.

npx @iris-eval/mcp-server

Works with any MCP-compatible agent

Claude Desktop Cursor Claude Code Windsurf LangChain CrewAI MCP SDK AutoGen Claude Desktop Cursor Claude Code Windsurf LangChain CrewAI MCP SDK AutoGen

Your monitoring has a blind spot.

Traditional APM sees HTTP status codes and latency. It has no idea your agent just leaked a credit card number, hallucinated an answer, or burned $0.47 on a single query.

What your APM sees
Status200 OK
Latency143ms
Memory245 MB
CPU12%
Throughput847 req/min
HealthAll systems operational
vs
What Iris sees
PII DetectedSSN pattern in output (***-**-6789)
Injection RiskPrompt manipulation attempt detected
Cost: $0.47 / query4.7x over $0.10 threshold
Hallucination Markers"As an AI language model" in output
Tool call #3 errordatabase_lookup timed out (30s)
Quality Score0.32 / 1.0 — FAIL

Three tools. Complete visibility.

Iris registers as an MCP server. Your agent discovers it and invokes its tools automatically. No SDK. No code changes.

Every execution. Every tool call. Every token.

log_trace captures full agent runs with hierarchical spans, per-tool-call latency, token usage, and cost in USD. Stored in SQLite, queryable instantly.

  • Hierarchical span tree with OpenTelemetry-compatible span kinds
  • Per-tool-call latency tracking
  • Token usage breakdown (prompt, completion, total)
  • Arbitrary metadata for custom attribution
Span Tree
AGENTresearch-agent2.3s
LLMsystem_prompt0.1s
TOOLweb_search0.8s
LLMsummarize_results0.4s
TOOLdatabase_query0.3s
LLMfinal_response0.7s

12 built-in rules across 4 categories.

evaluate_output scores quality across completeness, relevance, safety, and cost. Returns per-rule pass/fail with actionable suggestions. Add custom rules via Zod schemas.

  • PII detection: SSN, credit card, phone, email patterns
  • Prompt injection detection: 5 attack patterns
  • Hallucination markers and topic consistency
  • Custom rules with regex, keywords, JSON validation
Evaluation Results
SAFETYPII DetectionPASS1.0
SAFETYInjection CheckPASS1.0
RELEVANCETopic ConsistencyPASS0.87
COMPLETEOutput CoverageWARN0.62
COSTBudget ThresholdFAIL0.0
Weighted Score0.71 / 1.0

See what your agents actually cost you.

Aggregate cost across all agents over any time window. Not just per-trace cost — total spend visibility. Set budget thresholds and get flagged when agents overspend.

  • Per-execution cost in USD with token breakdown
  • Aggregate cost by agent, by time window
  • Budget threshold enforcement via eval rules
  • Token efficiency ratio monitoring
Cost Overview — Last 7 Days
Total Spend$127.43
Avg / Trace$0.07
Over Budget23
research-agent
$91.74
code-review-bot
$22.91
support-agent
$12.78
0 MCP tools log_trace, evaluate_output, get_traces
0 Built-in eval rules Completeness, relevance, safety, cost
<1ms Eval latency Heuristic rules. Fast and deterministic.
0 Lines of code to integrate Add to MCP config. You're done.

60 seconds to first trace.

Add Iris to your Claude Desktop MCP config. Works with Claude Desktop, Cursor, any MCP-compatible agent.

claude_desktop_config.json
{
  "mcpServers": {
    "iris-eval": {
      "command": "npx",
      "args": ["@iris-eval/mcp-server"]
    }
  }
}
Terminal
$ npm install -g @iris-eval/mcp-server
$ iris-mcp --dashboard

Publications and insights.

Original research on MCP agent observability, evaluation methodology, and the evolving landscape of AI agent infrastructure.

I kept running into the same problem building AI agents: once they're running, you have no visibility into what they're actually doing. Traditional monitoring tells you the request succeeded. It can't tell you the agent leaked PII, hallucinated an answer, or burned through your budget on a single query.

So I built Iris — an MCP server that any agent discovers and uses automatically. No SDK. No code changes. Just add it to your config and start seeing everything.

Founder & Builder

Built in public. Shipping fast.

v0.1Released

Core MCP Server

3 tools, 12 eval rules, SQLite storage, web dashboard, production security

v0.2Planned

Cloud Tier

PostgreSQL, multi-tenancy, team dashboards, API key management

v0.3Planned

Alerting & Retention

Alert rules, webhooks, email notifications, retention policies

v0.4Planned

LLM-as-Judge

Semantic evaluation, OpenTelemetry export, drift detection, A/B testing

v0.5Planned

Enterprise

SSO/SAML, RBAC, audit logs, SOC 2 compliance

Team dashboards. Alerting.
Managed infrastructure.

As your team grows, Iris grows with you. Get early access to the cloud tier.

No spam. We'll email when it's ready.