v1.8.52026-06-23QA pass on the 2026-06-22 batch: 40 refs re-inspected vs live + benchmarked against the `designlang` deterministic extractor. 35/40 PASS, mean accuracy 98.3/100, 5 MINOR token-value fixes applied.
QA pass on the 2026-06-22 batch: 40 refs re-inspected vs live + benchmarked against the designlang deterministic extractor. 35/40 PASS, mean accuracy 98.3/100, 5 MINOR token-value fixes applied.
- Two multi-agent workflows: per-ref live re-inspection + adversarial verify (QA), and our-tokens-vs-designlang-vs-reality comparison. Zero major errors, zero fabrications; brand primaries/fonts all held.
designlangindependently corroborated ~5.4/6 of our key tokens per ref. - 5 confirmed MINOR corrections (structural token values, not brand errors): paypal card shadow
0.1/4px/12px → 0.08/24px/48px(livelayered-card); chunghwa radius scale +20px (dominant modal/card default, ~46%); hyundai chatbot FAB box-shadowrgba(0,0,0,0.15) 0 0 20pxadded (CREATE sampled the wrapper, not the button); snapchat §6 "none across nav" softened — components flat but marketing nav wrapper carries one shadow; octopusenergy postcode input split into hero (white/Arial-600) vs tariffs (transparent/Chromatophore) surfaces. All 5 verify-reference green; 553 tests pass. - Benchmark write-up:
data/reference-audits/2026-06-23-qa-designlang-benchmark.md. Finding: our LLM pipeline is 39–0 (1 tie) more brand-accurate thandesignlang(which picks most-frequent color as "primary", ~35% brand-correct); recommendation = fold designlang in as a free deterministic Stage 0 (palette histograms + DTCG) feeding our brand-judgment pass.

