Evaluation on 10 AbdomenAtlas 3.0 Cases — GitHub
Merlin generates near-identical “normal” reports for every case, missing major pathology: 10 cm liver masses, 5 cm colon masses, massively enlarged organs.
13 separate generate() calls with independent prompts ("Generate a radiology report for {organ}###\n"). Each call sees 490 image tokens + ~15 prompt tokens independently. No holistic view — the model can’t correlate findings across organ systems. In isolation, each organ is most likely “normal” even in sick patients.
do_sample=False, num_beams=1 always picks the single most probable token. P(“normal” | image, organ_prompt) > P(any specific finding) for most organs. No exploration of lower-probability but correct paths like “lesion”, “mass”, or “enlarged”.
Trained on Stanford radiology reports where ~70–80% of per-organ findings are “normal/unremarkable.” The LoRA adapter (r=512 — extremely large rank) memorized this distribution. Not fixable at inference time without retraining.
Simple concatenation [490 image tokens, N text tokens] → self-attention. No explicit cross-attention module between vision and language. The language model’s strong “normal” prior overwhelms subtle visual signals.
Ground truth reports from AbdomenAtlas 3.0 are algorithmically generated with volumetric measurements and HU values. Merlin was trained on human-written Stanford reports (qualitative, no HU values). Even a perfect model would score low on BLEU/ROUGE due to this style gap.
5-year survival predictions vary across cases (CVD: 0.54–0.63), confirming the image features carry diagnostic signal. Some findings are detected: “surgically absent with bilateral salpingo-oophorectomy,” degenerative spine changes, occasional cysts. The text decoder fails to express pathology, not the image encoder.