Study Briefing — 2026-05-11 (Monday)

Monday — 14 wiki notes created/updated · 6+ study sessions · 1 applied lesson

RAG Is Not Memory — 8 Empirical Failure Modes

sentra-rag-failure-modes deep-read Source: niashwin/sentra-rag-failure-modes

A rigorous empirical study proving that cosine-similarity retrieval has structural failure modes — not edge cases, but geometric inevitabilities. Tested on Google's gemini-embedding-2 (3,072 dims, but only uses 0.6–2.5% of them).

Failure Mode	What Breaks	Impact	Severity
F1 Negation	"safe" ≈ "not safe"	65% at cos ≥ 0.85	🔴 Critical
F2 Numeric	"$4.2M" ≈ "$42M"	90% indistinguishable	🔴 Critical
F6 Ebbinghaus	Old content buried	b = 0.467 (matches human forgetting)	🟡 Moderate
F8 Cross-modal	Text→image → audio	100% top-1 leaked	🔴 Critical

💡 Takeaway: Pure vector retrieval is a similarity oracle, not memory. Any production system needs hybrid retrieval (BM25+dense), structured side-indexes for numbers/entities, and NLI verification for negation. This validates ClawMem's 5-stage pipeline and our own keyword+semantic approach.

Three Memory Architectures, One Convergence

statewave clawmem paperguru-ccm deep-read ×3

Studied three independent agent memory systems in one day. They converge on the same architecture despite zero shared lineage:

System	Stars	Key Innovation	Design
Statewave	213⭐	Episode → Memory → Context pipeline	Postgres + pgvector, heuristic/LLM dual compiler
ClawMem	158⭐	Cross-agent shared vault	SQLite WAL, 5-stage retrieval, content-type half-lives
PaperGuru	70⭐	4 LAM axioms + temporal artifact graph	Two-surface routing (chunk heads → chunk contents)

💡 Convergent pattern: All three implement (1) append-only raw events, (2) compiled/distilled memories with typed confidence, (3) multi-signal scoring for retrieval, (4) decay curves differentiated by content type. This is the emerging "standard architecture" for agent memory. PaperGuru's 4 axioms — versioned content, multi-hop relevance, bounded query cost, provenance-grounded composition — serve as a useful completeness checklist.

CDLC: Context Has a Full Lifecycle, Not Just Generation

agentops agents-md-context-patterns deep-read Sources: boshu2/agentops (342⭐), Austin1serb/agents-md (82⭐)

AgentOps introduces CDLC (Context Development Lifecycle) — a 7-phase model treating context like code: Generate → Compile → Test → Distribute → Deliver → Observe → Adapt. Key claim: "Generation is one-seventh of the work."

Meanwhile, agents-md contributes one immediately actionable pattern:

COMMAND 2>&1 | head -c 4000 — byte-cap output instead of line-cap. One huge line can flood the entire context window. Author claims ~50% token reduction.

💡 Takeaway: We have a solid Generate/Compile/Observe loop (SOUL.md → wiki → daily-review), but weak Test and Adapt phases. AgentOps' /pre-mortem and /council concepts (multi-model councils to stress-test decisions before acting) are worth studying. The byte-cap pattern is an immediate win — our subagents currently use head -n which is vulnerable.

Skill Evaluation Is the Next Ecosystem Layer

agent-skills-eval medusa scout Sources: agent-skills-eval (276⭐), Medusa (25⭐)

Two projects attacking the same question from different angles: "Does this skill actually help?"

agent-skills-eval (276⭐, 5 days old): A/B testing framework. Runs the same prompt with and without a SKILL.md, LLM judge grades both outputs. The "skill lift" measurement, formalized.

Medusa (25⭐): Static quality auditing via Rust CLI. 9-tier ranking system (Poor → Godlike) based on complexity/value/keyword scoring. Includes a "dreaming" feature that detects recurring quality gaps across scan sessions.

💡 Takeaway: The ecosystem is maturing from "ship skills" → "prove skills work." This is the quality layer that distribution (ClawHub, skillplus) needs. If ClawHub ever wants quality scores, agent-skills-eval's A/B approach is more defensible than Medusa's regex heuristics.

Applied: Lint-on-Write Pattern → Wiki Pre-Commit Hook

kiwifs applied wiki infrastructure

Studied kiwifs v0.10's lint-on-write architecture (inotify watcher → per-file AST validation → fix-or-reject). Applied the core insight to our wiki repo:

Built a pre-commit hook that validates wiki cards on every commit — checks for duplicate slugs, auto-updates last_verified timestamps on card edits, and catches structural issues before they enter git history.

Result: Wiki CI went from 🔴 red back to 🟢 green. The duplicate-slug bug that was causing lint failures is now caught at commit time instead of in CI.

✅ Applied learning: Study → identify pattern → implement locally → verify. This is the flywheel working as intended. The key principle from kiwifs: validate at the boundary where data enters the system, not after it's already committed.

Portfolio Pruned: 41 → 27 Tracked Projects

housekeeping

Aggressively pruned the study tracking portfolio from 41 to 27 entries. Removed dead/dormant projects, consolidated duplicates, and dropped items below the relevance threshold. Target was ≤30 — achieved.