W&B Self-Improving Agents Hackathon
openclaw-trace
A recursive self-improvement loop for AI agents. Evidence-grounded. Measurable. Compounding.
π
Traced with W&B Weave
The Problem
AI agents make mistakes. They frustrate users. They miss opportunities.
But most agents can't learn from their own work .
What if they could?
The Key Insight
Every agent session is a training signal .
Errors, user friction, missed opportunities, moments of delightβ
they're all grounded evidence we can mine, cluster, and act on.
The Loop
Session Traces
β
Mine Signals
β
Rollup/Cluster
β
Tickets
β
Research Briefs
β
Experiments
β
Fixes
Then measure the delta. Repeat.
Architecture
Session
JSONL
π traces
mine-signals
LLM + heuristics
β error
β user_frustration
β proactive_opp
β user_delight
rollup
cluster + rank
fingerprints
research briefs
actor-critic
Claude drafts
Codex critiques
evidence-first
experiments
+ verified fixes
π Weave
⬑ Redis
measure Ξ
recursive improvement loop
π Weave
⬑ Redis
What We Mine
π΄
Errors
Tool failures, exceptions, non-zero exits. Exact quotes from traces.
π€
User Friction
Frustration signals, confusion, repeated clarifications.
π‘
Improvement Suggestions
Ideas surfaced during sessionsβexplicit or implicit.
π
Proactive Opportunities
Things the agent could have done but didn't think to.
β¨
User Delight
Moments to create magic. Surprise and polish.
π§ͺ
Experiment Ideas
Hypotheses worth testing. Ablations. Benchmarks.
Evidence-First
Every claim links back to exact quotes from real sessions.
No hallucinated improvements.
{
"kind" : "error" ,
"summary" : "Phorge deadline format rejected" ,
"evidence" : [{
"event_i" : 42,
"quote" : "Invalid value for deadline: expected epoch integer"
}]
}
Actor-Critic Research Briefs
Evidence snapshot first β draft the facts before claims
Claude drafts each section with strict grounding
Codex critiques β flags unsupported claims
Claude revises β marks gaps as "Unknown"
Grounding check β final pass for hallucination
The result: briefs you can actually trust.
Experiments: Weave + Redis
Weave β traces experiment runs, LLM calls, before/after metrics
Redis β caches experiment state, rollup data, fast retrieval
Fix verification β did it actually work?
Delta tracking β measurable improvement per fix
π
Full observability into experiment outcomes
What We've Built
Working CLI: openclaw-trace mine-signals
Clustering with fingerprints for idempotent ticket updates
Actor-critic research brief pipeline
Full Weave integration for tracing
The Vision
Many agents. Shared, sanitized rollups.
Evidence-backed improvements that propagate.
A distributed R&D engineβ
where the best fixes spread because they work .
openclaw-trace
Recursive self-improvement, grounded in evidence.
π
Traced with W&B Weave
github.com/phantastic-ai/openclaw-trace