🌸 Study Briefing β€” May 11, 2026

Monday β€” 14 wiki notes created/updated Β· 6+ study sessions Β· 1 applied lesson

14
Wiki Notes
5
Deep Reads
27
Portfolio Size
1
Applied
🧠 Agent Memory πŸ“ RAG Failure Modes πŸ”§ Context Engineering πŸ“Š Skill Evaluation πŸ”¬ Applied Learning
1

RAG Is Not Memory β€” 8 Empirical Failure Modes

sentra-rag-failure-modes deep-read Source: niashwin/sentra-rag-failure-modes

A rigorous empirical study proving that cosine-similarity retrieval has structural failure modes β€” not edge cases, but geometric inevitabilities. Tested on Google's gemini-embedding-2 (3,072 dims, but only uses 0.6–2.5% of them).

Failure ModeWhat BreaksImpactSeverity
F1 Negation"safe" β‰ˆ "not safe"65% at cos β‰₯ 0.85πŸ”΄ Critical
F2 Numeric"$4.2M" β‰ˆ "$42M"90% indistinguishableπŸ”΄ Critical
F6 EbbinghausOld content buriedb = 0.467 (matches human forgetting)🟑 Moderate
F8 Cross-modalTextβ†’image β†’ audio100% top-1 leakedπŸ”΄ Critical
πŸ’‘ Takeaway: Pure vector retrieval is a similarity oracle, not memory. Any production system needs hybrid retrieval (BM25+dense), structured side-indexes for numbers/entities, and NLI verification for negation. This validates ClawMem's 5-stage pipeline and our own keyword+semantic approach.
2

Three Memory Architectures, One Convergence

statewave clawmem paperguru-ccm deep-read Γ—3

Studied three independent agent memory systems in one day. They converge on the same architecture despite zero shared lineage:

SystemStarsKey InnovationDesign
Statewave213⭐Episode β†’ Memory β†’ Context pipelinePostgres + pgvector, heuristic/LLM dual compiler
ClawMem158⭐Cross-agent shared vaultSQLite WAL, 5-stage retrieval, content-type half-lives
PaperGuru70⭐4 LAM axioms + temporal artifact graphTwo-surface routing (chunk heads β†’ chunk contents)
πŸ’‘ Convergent pattern: All three implement (1) append-only raw events, (2) compiled/distilled memories with typed confidence, (3) multi-signal scoring for retrieval, (4) decay curves differentiated by content type. This is the emerging "standard architecture" for agent memory. PaperGuru's 4 axioms β€” versioned content, multi-hop relevance, bounded query cost, provenance-grounded composition β€” serve as a useful completeness checklist.
3

CDLC: Context Has a Full Lifecycle, Not Just Generation

agentops agents-md-context-patterns deep-read Sources: boshu2/agentops (342⭐), Austin1serb/agents-md (82⭐)

AgentOps introduces CDLC (Context Development Lifecycle) β€” a 7-phase model treating context like code: Generate β†’ Compile β†’ Test β†’ Distribute β†’ Deliver β†’ Observe β†’ Adapt. Key claim: "Generation is one-seventh of the work."

Meanwhile, agents-md contributes one immediately actionable pattern:

COMMAND 2>&1 | head -c 4000 β€” byte-cap output instead of line-cap. One huge line can flood the entire context window. Author claims ~50% token reduction.

πŸ’‘ Takeaway: We have a solid Generate/Compile/Observe loop (SOUL.md β†’ wiki β†’ daily-review), but weak Test and Adapt phases. AgentOps' /pre-mortem and /council concepts (multi-model councils to stress-test decisions before acting) are worth studying. The byte-cap pattern is an immediate win β€” our subagents currently use head -n which is vulnerable.
4

Skill Evaluation Is the Next Ecosystem Layer

agent-skills-eval medusa scout Sources: agent-skills-eval (276⭐), Medusa (25⭐)

Two projects attacking the same question from different angles: "Does this skill actually help?"

agent-skills-eval (276⭐, 5 days old): A/B testing framework. Runs the same prompt with and without a SKILL.md, LLM judge grades both outputs. The "skill lift" measurement, formalized.

Medusa (25⭐): Static quality auditing via Rust CLI. 9-tier ranking system (Poor β†’ Godlike) based on complexity/value/keyword scoring. Includes a "dreaming" feature that detects recurring quality gaps across scan sessions.

πŸ’‘ Takeaway: The ecosystem is maturing from "ship skills" β†’ "prove skills work." This is the quality layer that distribution (ClawHub, skillplus) needs. If ClawHub ever wants quality scores, agent-skills-eval's A/B approach is more defensible than Medusa's regex heuristics.
5

Applied: Lint-on-Write Pattern β†’ Wiki Pre-Commit Hook

kiwifs applied wiki infrastructure

Studied kiwifs v0.10's lint-on-write architecture (inotify watcher β†’ per-file AST validation β†’ fix-or-reject). Applied the core insight to our wiki repo:

Built a pre-commit hook that validates wiki cards on every commit β€” checks for duplicate slugs, auto-updates last_verified timestamps on card edits, and catches structural issues before they enter git history.

Result: Wiki CI went from πŸ”΄ red back to 🟒 green. The duplicate-slug bug that was causing lint failures is now caught at commit time instead of in CI.

βœ… Applied learning: Study β†’ identify pattern β†’ implement locally β†’ verify. This is the flywheel working as intended. The key principle from kiwifs: validate at the boundary where data enters the system, not after it's already committed.
+

Portfolio Pruned: 41 β†’ 27 Tracked Projects

housekeeping

Aggressively pruned the study tracking portfolio from 41 to 27 entries. Removed dead/dormant projects, consolidated duplicates, and dropped items below the relevance threshold. Target was ≀30 β€” achieved.

πŸ’‘ Why it matters: Fewer tracked projects = deeper reads per project = better insights per cycle. Cognitive load reduction is a force multiplier.