Preprint 2026 · PRISM-VL

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Moving the visual interface for VLMs from post-ISP RGB renderings toward RAW-derived measurement evidence.

Kepeng Xu¹ Li Xu¹ Gang He¹ Wenxin Yu²

¹Xidian University ²Southwest University of Science and Technology

Paper PDF Method Results Citation

Rendered RGB can discard evidence before reasoning begins.

Lost measurement-domain signal crop — Lost signal

0.6120 BLEU

0.4571 ROUGE-L

82.66% LLM-Judge

150K Instruction examples

Abstract

Vision-language models are almost universally trained and evaluated on post-ISP RGB images, implicitly treating rendered appearance as a sufficient interface for multimodal grounding. However, RGB rendering is a lossy observation of the underlying sensor measurement: in low-light, high-dynamic-range, and exposure-imbalanced scenes, image signal processing can clip highlights, suppress structures, quantize evidence, and discard task-critical visual signals before reasoning begins.

We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, a framework that adapts VLMs to RAW-derived Meas.-XYZ inputs. PRISM-VL combines measurement-domain input, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation. On a held-out benchmark, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points.

Method Overview

PRISM-VL separates the model-facing observation, the annotation interface, and the capture context used for grounding.

PRISM-VL measurement-grounded vision-language learning framework — RGB proxies are used for supervision construction, while training and inference operate on RAW-derived Meas.-XYZ observations with camera-conditioned grounding.

Observation

Meas.-XYZ Input

A linear measurement-domain representation keeps the model closer to sensor evidence than post-ISP sRGB.

Conditioning

Camera Context

ISO, exposure time, aperture, and related metadata condition both the question and late visual representations.

Supervision

BracketSup

Exposure-bracketed RGB proxies generate reliable annotation signals that are attached back to the same RAW capture.

Why RGB Can Fail

The lost-signal residual concentrates around illuminated text and other weak evidence, which means the rendered RGB image no longer preserves all cues needed for grounding.

Sample 010 missing signal residual — Residual

Sample 010 luminance distribution — Luminance

Sample 019 missing signal residual — Residual

Sample 019 luminance distribution — Luminance

Results

PRISM-VL improves over RGB-native baselines across BLEU, ROUGE-L, and LLM-Judge accuracy, with the largest practical gains in exposure-sensitive regimes.

Capability radar comparing PRISM-VL against RGB-native VLMs — Capability fingerprint over chromatic, numerosity, HDR, low-illumination, scene-text, spatial, and verification dimensions.

Model	BLEU	ROUGE-L	Judge
Qwen3-VL-2B	0.3407	0.3171	69.54%
Qwen3-VL-4B	0.4442	0.3453	77.37%
Qwen3-VL-8B	0.5046	0.3500	78.20%
PRISM-VL-2B	0.5865	0.4244	77.99%
PRISM-VL-4B	0.6021	0.4465	80.83%
PRISM-VL-8B	0.6120	0.4571	82.66%

Qualitative Comparison

In low-illumination text examples, the RGB baseline grounds on incorrect visible-looking text, while PRISM-VL recovers the reference answer from measurement-domain evidence.

Illuminated shop sign

Question: What is the name of the illuminated shop next to the Beijing Roast Duck?

RGB observation for shop-sign example — RGB observation

Meas.-XYZ observation for shop-sign example — Meas.-XYZ observation

Zoomed measurement-domain crop for shop-sign example — Answer-region crop

RGB Qwen3-VL: "Hua Tian Hua" - incorrect.

PRISM-VL: "Zhengmei Dental Clinic" - correct.

Yellow sign text

Question: What is the word on the first line of the yellow sign?

RGB observation for yellow-sign example — RGB observation

Meas.-XYZ observation for yellow-sign example — Meas.-XYZ observation

Zoomed measurement-domain crop for yellow-sign example — Answer-region crop

RGB Qwen3-VL: "diamond" - incorrect.

PRISM-VL: "BLACK" - correct.

Citation

@article{xu2026allegory,
  title   = {Allegory of the Cave: Measurement-Grounded Vision-Language Learning},
  author  = {Xu, Kepeng and Xu, Li and He, Gang and Yu, Wenxin},
  journal = {arXiv preprint},
  year    = {2026}
}