Complete Gaussian Splats from a Single Image with Denoising Diffusion Models

¹University of Toronto, ²Niantic Spatial

Abstract

Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference.

Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single mode for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views.

In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360° renderings.

BibTeX

@article{liao2025complete, author = {Liao, Ziwei and Sayed, Mohamed and Waslander, Steven L. and Vicente, Sara and Turmukhambetov, Daniyar and Firman, Michael}, title = {Complete Gaussian Splats from a Single Image with Denoising Diffusion Models}, journal = {arXiv preprint arXiv:2508.21542}, year = {2025}, }

Complete Gaussian Splats from a Single Image with Denoising Diffusion Models

We predict full Gaussian scenes from a single RGB input image. Our diffusion-based model outputs sharper results than existing methods, and is also able to sample diverse completion "modes" given a single image as input.

Abstract

Related Work

Comparison to Related Work. The table highlights the main differences from closely related baselines. We propose a generative method with diffusion models to reconstruct 3D scenes with Gaussian splats in real time from a single image.

Method

Training a denoising diffusion model over the learned latent space. The network learns to convert a corrupted noisy latent code back to a ground-truth latent code representing a Splatter Image, by conditioning on a single input image.

Including skip connections helps preserve high-frequency details from the input view in the AutoReconstructor, improving the faithfulness of appearance.

Experiments

Qualitative results on the "Hydrants" category from the CO3D dataset. We produce significantly sharper results than PixelNeRF, and comparable or better performance on object areas compared to DFM, while being significantly faster.

Qualitative results on the "TeddyBears" category from the CO3D dataset. Our model produces sharper results with higher-quality details compared to the baselines LGM and SplatterImage, especially in occluded areas.

Qualitative Results on the RealEstate10K Dataset. Our method achieves comparable performance to DFM, a diffusion-based NeRF model, and performs better in some challenging regions (highlighted with dotted boxes), while being significantly faster at inference time.

3D Generative Performance. Our diffusion model demonstrates the ability to (1) sample diverse outputs in ambiguous situations, and (2) fill in missing areas using 3D priors learned from large datasets with multi-view consistency. Note the model is trained purely from only 2D images.

We present more quantitative and qualitative results in our main paper and Supplementary Materials.

BibTeX