Monday β 14 wiki notes created/updated Β· 6+ study sessions Β· 1 applied lesson
A rigorous empirical study proving that cosine-similarity retrieval has structural failure modes β not edge cases, but geometric inevitabilities. Tested on Google's gemini-embedding-2 (3,072 dims, but only uses 0.6β2.5% of them).
| Failure Mode | What Breaks | Impact | Severity |
|---|---|---|---|
| F1 Negation | "safe" β "not safe" | 65% at cos β₯ 0.85 | π΄ Critical |
| F2 Numeric | "$4.2M" β "$42M" | 90% indistinguishable | π΄ Critical |
| F6 Ebbinghaus | Old content buried | b = 0.467 (matches human forgetting) | π‘ Moderate |
| F8 Cross-modal | Textβimage β audio | 100% top-1 leaked | π΄ Critical |
Studied three independent agent memory systems in one day. They converge on the same architecture despite zero shared lineage:
| System | Stars | Key Innovation | Design |
|---|---|---|---|
| Statewave | 213β | Episode β Memory β Context pipeline | Postgres + pgvector, heuristic/LLM dual compiler |
| ClawMem | 158β | Cross-agent shared vault | SQLite WAL, 5-stage retrieval, content-type half-lives |
| PaperGuru | 70β | 4 LAM axioms + temporal artifact graph | Two-surface routing (chunk heads β chunk contents) |
AgentOps introduces CDLC (Context Development Lifecycle) β a 7-phase model treating context like code: Generate β Compile β Test β Distribute β Deliver β Observe β Adapt. Key claim: "Generation is one-seventh of the work."
Meanwhile, agents-md contributes one immediately actionable pattern:
COMMAND 2>&1 | head -c 4000 β byte-cap output instead of line-cap. One huge line can flood the entire context window. Author claims ~50% token reduction.
/pre-mortem and /council concepts (multi-model councils to stress-test decisions before acting) are worth studying. The byte-cap pattern is an immediate win β our subagents currently use head -n which is vulnerable.
Two projects attacking the same question from different angles: "Does this skill actually help?"
agent-skills-eval (276β, 5 days old): A/B testing framework. Runs the same prompt with and without a SKILL.md, LLM judge grades both outputs. The "skill lift" measurement, formalized.
Medusa (25β): Static quality auditing via Rust CLI. 9-tier ranking system (Poor β Godlike) based on complexity/value/keyword scoring. Includes a "dreaming" feature that detects recurring quality gaps across scan sessions.
Studied kiwifs v0.10's lint-on-write architecture (inotify watcher β per-file AST validation β fix-or-reject). Applied the core insight to our wiki repo:
Built a pre-commit hook that validates wiki cards on every commit β checks for duplicate slugs, auto-updates last_verified timestamps on card edits, and catches structural issues before they enter git history.
Result: Wiki CI went from π΄ red back to π’ green. The duplicate-slug bug that was causing lint failures is now caught at commit time instead of in CI.
Aggressively pruned the study tracking portfolio from 41 to 27 entries. Removed dead/dormant projects, consolidated duplicates, and dropped items below the relevance threshold. Target was β€30 β achieved.