How we ran the eval
Methodology — nothing hidden.
Pemberton’s whole claim is that Claude answers better when it knows you. That’s testable. We tested it. Here’s exactly how.
The setup
A real user wiki — built by running /wiki-build against their own notes, past Claude sessions, and connected services — was frozen as the test fixture. Ten prompts were written against the contents of that wiki, each with an explicit expected-signal rubric: what a Claude that genuinely knew this person should mention, and what it should avoid mentioning.
The prompts span five categories: forward planning, identity questions, cross-reference recall, backward reference (“remind me why I…”), and reply drafting.
The two runs
For each of the ten prompts, the model answered twice:
- Without Pemberton. Vanilla Claude. No wiki, no prior context, no memory.
- With Pemberton. The same model, with the Pemberton-built wiki block injected into the system prompt.
Both runs used Claude Sonnet 4.6 at the same temperature. The only variable was whether Pemberton’s context was present.
The judge
Each prompt’s two responses were passed to Claude Opus as a blind pairwise judge. The judge saw the prompt, the rubric, and the two responses labeled only A and B. It scored each response against the rubric (0–3 points), justified its scores, then picked a winner.
The judge did not know which response came from which condition.
The result
10 wins out of 10 prompts. Average margin: 3 to 0 on the rubric. Zero ties.
Pemberton-equipped Claude correctly recalled specific projects by name, surfaced unpushed work, declined a recruiter outreach for the right reason, named the killer chart in an investor deck, and proposed a writing topic grounded in what the user had already published.
Unaided Claude, in every case, responded to “plan my week” or “polish this slide” with some version of “could you tell me what you’re working on?”
Why we trust this
The wiki used in the evaluation was real — a user’s actual notes, projects, and people, not a synthetic fixture. The prompts were written against the real content, not against generic personas. The rubric was set before the runs, not adjusted after seeing results. The judge was a separate, more capable model, scoring blind.
It’s a small N. Ten prompts isn’t a benchmark; it’s a sanity check. The harder version of this evaluation — does Pemberton’s auto-built wiki match a hand-crafted one written by an engineer who already knows the user? — is the bar for v1.1.
See it for yourself
The whole thing was pre-built so you can run it on your own wiki after install. The eval/ directory in the Pemberton tarball ships the prompts, the runner, and the judge scaffolding. Build your wiki, write your own rubric, run the script. Send us the results.