W&B Self-Improving Agents Hackathon

openclaw-trace

A recursive self-improvement loop for AI agents.
Evidence-grounded. Measurable. Compounding.

🐝 Traced with W&B Weave

The Problem

AI agents make mistakes. They frustrate users. They miss opportunities.

But most agents can't learn from their own work.

What if they could?

The Key Insight

Every agent session is a training signal.

Errors, user friction, missed opportunities, moments of delight—
they're all grounded evidence we can mine, cluster, and act on.

The Loop

Session Traces

→

Mine Signals

→

Rollup/Cluster

→

Tickets

→

Research Briefs

→

Experiments

→

Fixes

Then measure the delta. Repeat.

Architecture

What We Mine

🔴

Errors

Tool failures, exceptions, non-zero exits. Exact quotes from traces.

😤

User Friction

Frustration signals, confusion, repeated clarifications.

💡

Improvement Suggestions

Ideas surfaced during sessions—explicit or implicit.

🚀

Proactive Opportunities

Things the agent could have done but didn't think to.

✨

User Delight

Moments to create magic. Surprise and polish.

🧪

Experiment Ideas

Hypotheses worth testing. Ablations. Benchmarks.

Evidence-First

Every claim links back to exact quotes from real sessions.
No hallucinated improvements.

      # Every signal includes grounded evidence

      {

        "kind": "error",

        "summary": "Phorge deadline format rejected",

        "evidence": [{

          "event_i": 42,

          "quote": "Invalid value for deadline: expected epoch integer"

        }]

      }

Actor-Critic Research Briefs

Evidence snapshot first — draft the facts before claims
Claude drafts each section with strict grounding
Codex critiques — flags unsupported claims
Claude revises — marks gaps as "Unknown"
Grounding check — final pass for hallucination

The result: briefs you can actually trust.

Experiments: Weave + Redis

🐝

Weave traces

⬡

Redis state

Weave — traces experiment runs, LLM calls, before/after metrics
Redis — caches experiment state, rollup data, fast retrieval
Fix verification — did it actually work?
Delta tracking — measurable improvement per fix

🐝 Full observability into experiment outcomes

What We've Built

100+

sessions mined

7

signal types

∞

loop iterations

Working CLI: openclaw-trace mine-signals
Clustering with fingerprints for idempotent ticket updates
Actor-critic research brief pipeline
Full Weave integration for tracing

The Vision

Many agents. Shared, sanitized rollups.
Evidence-backed improvements that propagate.

A distributed R&D engine—
where the best fixes spread because they work.

openclaw-trace

Recursive self-improvement, grounded in evidence.

🐝 Traced with W&B Weave

github.com/phantastic-ai/openclaw-trace

openclaw-trace

A recursive self-improvement loop for AI agents.Evidence-grounded. Measurable. Compounding.

The Problem

The Key Insight

The Loop

Then measure the delta. Repeat.

Architecture

What We Mine

Errors

User Friction

Improvement Suggestions

Proactive Opportunities

User Delight

Experiment Ideas

Evidence-First

Actor-Critic Research Briefs

The result: briefs you can actually trust.

Experiments: Weave + Redis

What We've Built

The Vision

openclaw-trace

Recursive self-improvement, grounded in evidence.

A recursive self-improvement loop for AI agents.
Evidence-grounded. Measurable. Compounding.