We Tried Training Models to Verify Themselves. It Made Them Worse.
The obvious idea: train a model to be its own verifier. Feed it correct and incorrect code pairs. Let it learn what “verified” looks like. Skip the external verification step.
We ran the experiment. It does not work.
The setup
StarCoder2-3B as the base model. DPO (Direct Preference Optimization) with curated verification pairs — code that passes verification vs. code that fails. We scaled the training data from 120 pairs to 2,000.
Total cost: $172.
The results
DPO training pairs vs. HumanEval accuracy. Y-axis: number of training pairs. Bar length: accuracy percentage.
120 curated pairs: 91.5%. The only configuration that beat baseline. Every increase after that made the model worse. At 2,000 pairs: 77.4%. Catastrophic collapse.
Why more data made it worse
Three failure modes appeared as we scaled:
1. Distribution shift
Automated pair generation drifted from the target distribution. MBPP-derived pairs did not transfer to HumanEval. Domain overfitting.
2. Preference collapse
The model learned surface patterns — code length, comment density, variable naming — instead of actual correctness signals.
3. Verification overwriting
DPO training degraded the model's core coding ability. It got better at looking verified while getting worse at being correct.
What this proves
You cannot push Layer 2 into Layer 3. Verification cannot be absorbed into the model. 120 curated pairs is the ceiling, not the starting point.
This is consistent with Karpowicz (2024): self-verification of generative outputs is equivalent to the halting problem. The math says you need an external verifier. The experiment confirms it.
Implication:
Layer 2 is permanent. Not a stepping stone. Not a temporary workaround. Permanent infrastructure for every AI system that generates output.
Drop a repo link. I'll run it for free.
— Ty