Merlin CT Report Generation

1. Per-Organ Fragmentation Critical

13 separate generate() calls with independent prompts ("Generate a radiology report for {organ}###\n"). Each call sees 490 image tokens + ~15 prompt tokens independently. No holistic view — the model can’t correlate findings across organ systems. In isolation, each organ is most likely “normal” even in sick patients.

2. Greedy Decoding Locks In “Normal” Critical

do_sample=False, num_beams=1 always picks the single most probable token. P(“normal” | image, organ_prompt) > P(any specific finding) for most organs. No exploration of lower-probability but correct paths like “lesion”, “mass”, or “enlarged”.

3. Training Data Distribution Bias Critical

Trained on Stanford radiology reports where ~70–80% of per-organ findings are “normal/unremarkable.” The LoRA adapter (r=512 — extremely large rank) memorized this distribution. Not fixable at inference time without retraining.

4. Naive Image-Text Fusion High

Simple concatenation [490 image tokens, N text tokens] → self-attention. No explicit cross-attention module between vision and language. The language model’s strong “normal” prior overwhelms subtle visual signals.

5. Domain Mismatch in Evaluation High

Ground truth reports from AbdomenAtlas 3.0 are algorithmically generated with volumetric measurements and HU values. Merlin was trained on human-written Stanford reports (qualitative, no HU values). Even a perfect model would score low on BLEU/ROUGE due to this style gap.

6. Evidence the Image Encoder Works

5-year survival predictions vary across cases (CVD: 0.54–0.63), confirming the image features carry diagnostic signal. Some findings are detected: “surgically absent with bilateral salpingo-oophorectomy,” degenerative spine changes, occasional cysts. The text decoder fails to express pathology, not the image encoder.

Architecture

NIfTI → MONAI (224×224×160, HU [-1000,1000] → [0,1]) → ModifiedImageEncoder (3D Inflated ResNet152) → (B, 490, 2048) → Adapter (Linear 2048→4096) → (B, 490, 4096) → [concat with text prompt embeddings] → RadLLaMA-7B + LoRA(r=512) → autoregressive text

Ground Truth Report

Merlin Predicted Report

Key Findings Missed by Merlin

5-Year Disease Risk Predictions

Per-Case Metrics