Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

Soumava Paul^*, Prakhar Kaushik^*, Alan Yuille

CCVL, Johns Hopkins University

^*Equal contribution

arXiv Code (May 20 EOD) SysCON3D Benchmark Human Evaluation

Teaser figure: unrelated scenes, repeated views, and Gaussian noise are scored as highly consistent by MEt3R and yield dense hallucinated geometry from VGGT, MASt3R, DUSt3R, and Fast3R.

Unrelated scenes, repeated views, and Gaussian noise should be scored as 3D-inconsistent — they do not define a single static scene. Yet learned backbones (VGGT, MASt3R, DUSt3R, Fast3R) still produce dense geometry on these inputs, and metrics built on top (such as MEt3R) inherit the hallucination. We diagnose this with SysCON3D, a backbone–residual–aggregation family, and robust, human-aligned COLMAP-based metrics.

Abstract

Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce SysCON3D, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to 3× more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to 4× higher correlation with human judgments than MEt3R.

Key takeaways

Unified framework

We cast existing and new 3D-consistency metrics as a composition of backbone × residual × aggregation, and study distributional aggregations (e.g. MMD against a zero-error reference) that preserve information mean-based scores discard.

SysCON3D stress test

A controlled benchmark that injects 3D inconsistencies -- single outliers, patched regions, cross-scene mixtures, and Gaussian noise -- into multi-view sets, and reveals where current metrics rank inconsistent inputs above consistent ones.

COLMAP holds up

Despite its age, COLMAP-based metrics are robust to noise/outliers and align up to 4× better with human ranking than MEt3R, suggesting that learned 3D backbones are not yet drop-in replacements for evaluation.

How our metric compares

Metric	Human-aligned	Robust	Scene-level	Interpretable
MEt3R	✗	✗	✓	✗
PRISM	✓	✓	✗	✗
Ours (MASt3R-W-IMQ)	✓	✓	✓	✗
Ours (COLMAP-based)	✓	✓	✓	✓

3D backbones don't abstain on inconsistent inputs

We feed data-driven 3D reconstruction backbones inputs that have no shared 3D structure -- identical images, images from different scenes, and pure Gaussian noise -- and observe that they refuse to abstain from generating 3D geometry, returning dense, high-confidence point clouds anyway. Any consistency metric built on top inherits this behavior.

Data-driven backbones such as VGGT/MASt3R still produce dense geometry on noise / cross-scene / identical-image inputs.

Interactive SysCON3D reconstruction gallery

Choose a deterministic SysCON3D stress-test sample and compare the point clouds produced by MASt3R, DUSt3R, Fast3R, VGGT, and RobustVGGT. These are prerecorded GLB reconstructions from the benchmark, so the project page stays lightweight and does not run live GPU inference.

SysCON3D sample

All views are deterministic Gaussian noise images with no 3D scene.

Drag any panel to orbit the 3D point cloud. Use scroll or pinch to zoom.

MASt3R Gaussian random noise, K=6

DUSt3R Gaussian random noise, K=6

Fast3R Gaussian random noise, K=6

VGGT Gaussian random noise, K=6

RobustVGGT Gaussian random noise, K=6

A model that knew when to abstain would return an empty or low-confidence cloud on inputs with no coherent scene. The learned backbones often return structured geometry anyway.

Why do they not abstain? Four diagnostics

We probe each backbone with controlled cross-scene mixtures and Gaussian-noise inputs and track where the failure happens -- before features are even compared.

−63 pp

Zero-overlap rate underestimation

On the hardest cross-scene mixture (L₃, where 89% of pairs are cross-scene), DUSt3R reports zero-overlap on only 26% of pairs. MASt3R, FASt3R, and VGGT still under-detect by 21, 48, and 18 pp respectively. The backbone declares overlap where geometrically there should be none.

100% on noise

Cross-scene FNR -- hallucinated correspondences

Conditioning only on truly cross-scene pairs, DUSt3R falsely declares overlap on 64-77% across mixture levels; MASt3R is the most conservative at 18-26%. On Gaussian noise, MASt3R, DUSt3R, and VGGT hallucinate overlap on every pair; FASt3R on 22.7% of pairs.

~0.4 gap, large σ

DINOv2 residuals don't separate

Once the backbone declares two views overlap, cross-scene DINOv2 cosine residuals are only ~0.42-0.46 above same-scene residuals (MASt3R, DUSt3R), with standard deviations large enough to overlap. DINOv2 captures semantic similarity, not per-instance 3D identity, so spurious correspondences score nearly as low as real ones.

0.80 ghost mass (VGGT, noise)

Ghost mass -- rendered pixels from the wrong scene

For the global backbones (FASt3R, VGGT) we measure the fraction of rendered pixels supported by points from a foreign scene. On L₃ and full mixture roughly 40-50% of pixels come from wrong-scene points; on Gaussian noise VGGT's ghost mass climbs to 0.80 ± 0.10 -- the reconstruction is mostly invented.

The interactive gallery above shows representative precomputed reconstructions for these stress cases across K ∈ {3, 6}.

SysCON3D — controlled inconsistency stress test

SysCON3D injects four families of 3D inconsistency into clean multi-view sets so we can measure how a metric ranks known inconsistent samples against clean ones. The deterministic benchmark manifests and image payload are available on HuggingFace; the interactive gallery above shows representative precomputed reconstructions.

One-outlier perturbation example — One-outlier -- replace one frame with an unrelated image.

Patched perturbation example — Patched -- paste a foreign region into a subset of frames.

Cross-scene mixture example — Cross-scene mixture -- interleave frames from different scenes.

Gaussian noise perturbation example — Gaussian noise -- replace pixel content with random noise.

Human evaluation interface for ranking NVS outputs.

Human evaluation study

We ground all metric claims in a structured human study where evaluators view two NVS-rendered video clips side-by-side and rank them along three independent axes: 3D consistency, realism, and plausibility w.r.t. the input photos.

Each pair supports variable-speed playback and an optional free-text "why" prompt per axis, so judgments capture which dimension drove the preference rather than collapsing to a single scalar.

Sessions track a configurable target number of votes per evaluator, enabling multi-rater agreement analysis on top of the raw rankings.

Open the human evaluation site

BibTeX

@misc{paul2026viewssceneevaluatingmultiview,
      title={Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate}, 
      author={Soumava Paul and Prakhar Kaushik and Alan Yuille},
      year={2026},
      eprint={2605.18754},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18754}, 
}