The Proof
Benchmarks, audits, and verification results -- all independently reproducible.
New Case Study
LUCID vs OpenClaw -- 5,000 files, 235 claims, 15.8% compliance
Full pipeline run against a real open-source AI assistant. 43 failures, 15 partial gaps, 1 critical security issue.
100% pass @ k=5 on HumanEval (164/164)
164 coding tasks. Four verification methods compared across k=1, k=3, and k=5 iterations.
| Method | k=1 | k=3 | k=5 |
|---|---|---|---|
| Baseline | 86.6% | -- | -- |
| Self-Refine | 87.2% | 87.2% | 87.8% |
| LLM-as-Judge | 98.2% | 99.4% | 97.2% |
| LUCID | 98.8% | 100% | 100% |
LLM-as-judge drops to 97.2% at k=5 -- false positive accumulation. LUCID converges monotonically to 100%.
300 real software engineering tasks
SWE-bench Lite: real GitHub issues from production repositories.
18.3%
Baseline k=1
25%
LUCID k=1
+36.6%
30.3%
LUCID best
+65.6%
Won 7 of 10 head-to-head tasks
Real-world coding challenges scored by expert judges on correctness, security, and edge-case handling.
21.6/30
Baseline
27.2/30
Forward LUCID
7 of 10 tasks won