wrong. how it works

methodology · sources · limits

How wrong. actually works.

A record of what each AI critic sees, how the wrongness score is computed, where the limits are, and what you should verify yourself before trusting any of it.

What each critic does

When you submit an argument, the server fans the same text out to three independent critics in parallel. The critics never see each other's outputs. They critique the same text in isolation, which is why they sometimes disagree, and why the disagreement is worth surfacing.

  • Evidence Stress-tests the factual base. Sourcing, recency, accuracy, cherry-picks, fabricated citations, "studies show" with no studies named, numeric claims that don't check out.
  • Reasoning Tests the warrant chain. Whether premises actually entail the conclusion, smuggled causation, available link turns and impact turns, galaxy-brain pacing, double-binds.
  • Framing Checks structural integrity. Falsifiability, scope vs evidence, motte-and-bailey, definitional drift, loaded framing, missing steelman.

What the critics can do

The critics don't just read your argument from cold memory. They have working tools and use them.

  • Web search. The Evidence critic searches the open web to confirm cited papers exist, that named authors wrote what's quoted, that statistics trace back to a real primary source. If a citation can't be found — or can be found but says something different from what your argument claims it says — that's flagged as a finding with the verified URL in the fix.
  • Web fetch. Any URL you cite gets pulled directly. The critic reads the actual page and checks whether your claim is supported by the source's actual content, not just the source's title.
  • Arithmetic. Every numeric claim in your argument gets recomputed. Percentages, totals, rates of change, compounded probabilities. "30% of 4M = 1.2M" gets checked. If the math is off, the load-bearing claim collapses and the critic flags it with the correct number.
  • Definition checks. The Framing critic searches for canonical definitions of contested or technical terms when it suspects the argument is shifting meaning between premise and conclusion.

Verification is bounded. The critics search 2-5 times each, not 50. They prioritize the load-bearing claims — the ones the argument actually rests on — over peripheral details. A clean web search is good signal but not a guarantee; the model can still misread a verified source.

How the score is computed

Each critic returns a JSON object: a sub-score, a one-line headline, and a list of findings. Each finding has a severity (1-10), the issue, an evidence quote from the argument, and a fix.

The Synthesizer then reads all three critic outputs plus the original argument and does four things: dedupes overlapping flaws across critics, ranks the survivors by severity and load-bearing weight, calibrates the score against a five-band rubric of worked anchor examples, and emits a single 0-100 wrongness number plus the kill sentence.

The Synthesizer does not generate new findings. It only weighs, dedupes, ranks, and calibrates what the three critics already produced. If a critique appears in the verdict, at least one critic flagged it.

band rubric

0–14 · YOU'RE RIGHT · argument is tight, sources current, warrants explicit, scope matches evidence

15–34 · mostly holds · survives, with minor gaps a careful reader would pick at

35–54 · uneven · core claim survivable, multiple under-warranted steps

55–74 · structurally weak · load-bearing flaws; conclusion not entailed by what's offered

75–100 · WRONG · the argument fails on its own terms — fabricated, unfalsifiable, or self-contradicting

The model

Every critic and the synthesizer run against Claude, Anthropic's frontier general-purpose model. The site uses Claude and only Claude — no fine-tuning, no other models, no fallback chain. The single best thing about this design is that as Claude gets better, this tool gets better, automatically. Every Claude release shows up here.

What it still can't do

Even with web verification and arithmetic, this is part of the page to read before you trust any individual finding.

  • Verification isn't exhaustive. The critics search 2-5 times per argument and pick the most load-bearing claims to check. Peripheral facts may pass through unverified. A finding that doesn't appear in the report isn't proof the claim is true — only that the critics didn't focus on it.
  • Tool failures pass silently. If a search returns nothing, or a URL is paywalled, or the source has been edited since you cited it, the critic may not flag the issue with confidence. Unverified ≠ wrong, but it also ≠ verified.
  • The model can still misread a verified source. Just because Evidence pulled the actual page doesn't mean it interpreted it correctly. Treat counter-facts the critics produce as prompts to look closer, not as final answers.
  • Prompt-bound reasoning. Each critic only sees the argument you submitted. Context you have but didn't include — the audience, the prior conversation, why a claim is load-bearing in your situation — is invisible. Sometimes a finding is correct in the abstract but irrelevant in your case.
  • Calibration drift. The band rubric is anchored on five worked examples. Very short arguments, very technical arguments, or arguments dressed in unusual rhetoric can land at scores that don't match how a human expert would grade them.

How to verify what you got

Every result page has an Everything else drawer at the bottom. It contains the full per-critic output: every finding each critic raised, with the evidence quote and the fix, before the Synthesizer collapsed them. If a finding looks wrong, open the drawer and check whether it actually showed up in raw critic output or appeared only in the deduped top-3.

If a critic cites a fact, search for it. If a critic claims a source doesn't exist, check whether it does. The score is a starting point for a closer read, not a verdict you should hand to anyone.

Why three critics and a synthesizer

One AI grading an argument gives you one read. Three independent critics with different jobs sometimes converge on the same flaw — which is signal — and sometimes split, which is also signal. The "the critics disagreed" callout on the result page is a feature, not a bug. Disagreement is where you find the real tension in your argument.

The Synthesizer exists to keep the verdict honest: if two critics raised overlapping but slightly different versions of the same flaw, that flaw should not be triple-counted. If one critic was clearly off-base, the others should outvote it.

What changes from here

The band rubric is calibrated, not absolute. As more arguments come through and more failure modes are documented, the worked anchors will get re-tuned. The prompts will tighten. The model will swap. This page will be updated when those things change. The source repo is the canonical record.