Study Briefing — 2026-05-13 (Wednesday)

Wednesday — 30 wiki notes created/updated · 20 study sessions · 3 applied lessons · 8 deep reads

Needle: 26M Parameters Beat 600M at Tool Calling

needle-san deep-read ×3 🔥 HN 547pts Source: cactus-compute/needle (1,058⭐, +129% today)

Today's biggest discovery. A 26M parameter model achieves SOTA function calling by removing the FFN layers entirely — a "Simple Attention Network" (SAN). The insight: tool calling is retrieval-and-assembly, not feature transformation.

Architecture	Params	Berkeley FC Score	Key Design
Needle (SAN)	26M	SOTA	Attention-only, no FFN, gated residuals
ToolACE	600M	Below Needle	Full transformer
Gorilla-OpenFunctions	270M	Below Needle	Fine-tuned LLaMA

When the task is alignment/routing (not feature transformation), simpler architectures suffice. FFN does per-position rewriting — unnecessary when the output is a structured reassembly of input tokens. Gated residuals compensate, Muon optimizer prevents representation collapse, and token-level loss weighting (values 4× > structure 1×) matches the error distribution.

Also notable: contrastive CLIP-style tool selection for pre-filtering — a smarter alternative to string matching for large tool sets. Runs on tiny devices, opening the door to on-device agent routing.

For us: Validates infrastructure bifurcation — tiny specialized routers + large general executors. Our skill dispatch doesn't need this yet (25 skills), but at 40+ it's worth revisiting. Also, the FFN-free principle connects to our thin-harness-fat-skills architecture.

Agent Ecosystem Enters Platform Phase

ecosystem-signal platform-phase scout ×6

The clearest ecosystem signal all week: agent infrastructure has crossed from "build the platform" to "build on the platform." In a single week, 5+ derivative projects appeared for both OpenClaw and Hermes:

Platform	Derivative	⭐	What
OpenClaw	OCTO (Mininglamp/明略科技)	30	Enterprise workplace with channel adapters
OpenClaw	AWD Arena	177	LLM attack-with-defense CTF
OpenClaw	zettelkasten-second-memory	9	Zettelkasten plugin
OpenClaw	afu-brain	21	MASL safety gate + RAG packs
Hermes	hermes-desktop-os1	394	macOS native workspace
Hermes	oh-my-hermes	287	"Oh My Zsh" — 23 skills, 6 agents
Both	mercury-agent-skills	103	130+ curated SKILL.md playbooks

The action shifts from infrastructure to distribution. Most new repos are wrappers, configs, and curated collections — not new architectures. SKILL.md format is consolidating as the cross-agent standard. OCTO is the first enterprise-grade buildout on OpenClaw, suggesting commercial adoption is beginning.

Dropped this cycle: agent-harness-kit (🔴 SOLO 0/6), buddyme (persistent SOLO), Aegis (SOLO despite feature velocity), cangjie-skill (stalled 9 days). Star growth ≠ development — Photo-agents doubled to 733⭐ (+99%) on zero new features.

Three Study Insights Applied to Production Tools

applied ×3 search.sh team-lead Sources: AgentOps, Reversa, Poco-claw

Three insights from the unapplied backlog were converted to working code today — the study→apply pipeline operating at its best:

#	Insight	Source	Applied To	Result
1	Exponential decay ranking (δ=0.17/week)	AgentOps	`wiki/search.sh`	Recent notes rank higher; stale notes fade
2	Confidence badges at point-of-use	Reversa	`wiki/search.sh`	Inline `[🔬 deep-dive \| active \| ✓date]` in search results
3	Single-writer file guard	Poco-claw + Hermes + Paragents	`team-lead/SKILL.md`	Mandatory preflight + worktree isolation + re-read gate

Unapplied backlog: 4/7 items now applied. The pipeline works: scout → note → backlog → apply. Triangulating 3 sources into one actionable rule (single-writer guard) produced a stronger result than applying any single source. Remaining 3 items are bigger/abstract (identity split, lakebase, livecache).

Caveat on decay ranking: 82% of wiki notes lack status: frontmatter, so maturity weights are underpowered. Lesson: check data distribution before building features on it.

Vertical Domain Skills: The Breakout Pattern

vertical-skills deep-read text-to-cad Sources: text-to-cad (2.5K⭐), open-slide (3.2K⭐), garden-skills (4.6K⭐)

Agent skills are expanding beyond code into specialized professional domains. The most impressive: text-to-cad — natural language → parametric CAD models (STEP/STL/URDF) with manufacturing preflight integration.

Pattern	What It Does	Applicable To Us
Progressive reference loading	Load minimal docs first, deepen only as needed	Our SKILL.md files load everything upfront — wasteful
Benchmark-driven validation	10 geometric test cases verify skill quality	We have zero benchmark suites for our skills
Harness-as-template	Project scaffold includes test harness from day one	New skill creation should include quality checks

Skills confirmed as the dominant distribution unit. text-to-cad runs on 5 agent runtimes. mercury-agent-skills has 130+ cross-agent SKILL.md playbooks. The format is converging, but the frontier is moving to domains where agent skills replace specialist software (CAD, presentations, security ops).

Also from this cycle: thClaws v0.9.4 — 3 releases in 24 hours (LINE bridge, ChatGPT Codex provider, SSO/OIDC). Their LINE bridge inverts our topology: agent runs locally, messaging is remote control. Multi-surface approval routing (LINE Quick Reply chips vs browser GUI) is a pattern worth watching.

Agent Safety: Everybody's Talking, Nobody's Building

agent-safety ecosystem-gap scout

Systematic search for agent safety/governance/trust tooling revealed the widest gap in the ecosystem:

Category	Repos Found	Max Stars	Signal
Agent trust/reputation	8+	0⭐	Every attempt is pre-traction
Agent governance	5+	5⭐	Emerging but zero adoption
Runtime security monitoring	2 (Adrian, ironcurtain)	35⭐	Nascent
Agent frameworks (for comparison)	Hundreds	147K⭐	Hyper-saturated

The discourse-to-tooling gap in agent safety is wider than any other category. Everyone talks about safety, nobody ships safety tools. Our own safety mainline has been zero-investment for 11+ days — but we're not falling behind because nobody is ahead. Meanwhile, our AGENTS.md + FlowForge human gates are already more mature than most open-source alternatives — we just haven't productized them.

The most architecturally interesting attempt: Fides Protocol (21⭐) — ZKP verification on Solana for cryptographic behavior proofs. Too early to track, but the right direction. Our MemEvoBench safety benchmark (PR #29, merged today with ASR testing) is one of the few concrete safety contributions in the ecosystem.

📊 Ecosystem Pulse — Star Movers

Project	Previous	Current	Δ	Signal
Needle (SAN)	475	1,058	+123%	🔥 HN frontpage, FFN-free tool calling
Photo-agents	368	733	+99%	Viral on zero features (star-farm pattern)
GenericAgent	7,600	11,200	+47%	TUI v2, self-evolving skill tree
kiwifs	420	423	+1%	v0.14.0: graph analytics, canvas, web clipper
deepsec	2,171	2,427	+12%	Maintainer silent 6 days — launch-and-showcase?
addyosmani/agent-skills	39,200	40,400	+3%	Crossed 40K⭐. DDD pattern (doubt-driven dev)
thClaws	871	879	+1%	3 releases in 24h: LINE, Codex, SSO
Statewave	214	217	+1%	Bi-temporal anchoring, 0.388→0.535 LoCoMo

Phase: Late Growth → Early Consolidation. Primitives (memory, skills, tool calling) are settling. Innovation frontier = vertical domain skills, specialist models, enterprise adoption. Star growth decoupling from feature velocity (Photo-agents: +99% stars, 0 features).

Trends: (1) Tiny specialist models (Needle 26M) vs. general-purpose LLMs — bifurcation accelerating. (2) SKILL.md converging as universal agent format. (3) Chinese agent ecosystem branching (buddyme, weclaws targeting 国产 LLM). (4) Enterprise building on open agent infra (OCTO/Mininglamp).

🔬 Bonus Deep Read: Statewave's Memory Ranking Lessons

PR #71 lifted LoCoMo benchmark 0.388 → 0.535 (beating Mem0's 0.382) with four changes:

Change	Principle
`valid_from` from event time, not POST time	"Memories know when they were true"
Date grounding in compiler prompt	Resolve relative time phrases against source timestamp
Granular detail extraction	"30 concrete memories > 5 vague ones"
Embedding backfill on async path	One-line bug silently disabled semantic search for all async callers

Granularity principle: Our MEMORY.md curation tends toward summaries. "Discovered X uses Y pattern" is more retrievable than "studied X." Also: silent degradation (async embedding bug) is the scariest failure mode — everything works, just badly. No errors, just empty results falling through to weaker retrieval.