Complete Gaussian Splats from a Single Image with Denoising Diffusion Models

1University of Toronto, 2Niantic Spatial
Teaser 2

We predict full Gaussian scenes from a single RGB input image. Our diffusion-based model outputs sharper results than existing methods, and is also able to sample diverse completion "modes" given a single image as input.

Abstract

Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference.

Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single mode for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views.

In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360° renderings.

Related Work

Teaser 2

Comparison to Related Work. The table highlights the main differences from closely related baselines. We propose a generative method with diffusion models to reconstruct 3D scenes with Gaussian splats in real time from a single image.

Method

Teaser 2

Learning a latent space for 3D representations using only images, without ground-truth 3D data. (a) Variational Autoencoders require groundtruth samples of high-dimensional variables \( x \) to learn a latent space; (b) We propose the Variational AutoReconstructor, which learns a latent space for \( x \) using supervision from only their projections \( \{m = f(x)\} \).

Teaser 2

Learning a latent space for Splatter Images. Our encoder predicts the parameters of a normal distribution over latents. We reconstruct a sampled latent into \( H \times W \times MN \) Splatter Image representations. We render the Gaussian splats from the viewpoints of the target training images and optimize reprojection losses between the rendered and ground-truth RGB images. Skip connections are critical to preserving the high-frequency details of the predictions, as shown in the Figure below.

Teaser 2

Training a denoising diffusion model over the learned latent space. The network learns to convert a corrupted noisy latent code back to a ground-truth latent code representing a Splatter Image, by conditioning on a single input image.

Teaser 2

Diffusion Inference Pipeline. We first compute input image features using the Stable Diffusion encoder. A random latent code and the input image features are concatenated and passed through \( R \) steps of the denoising diffusion process. The denoised latent code, together with skip connections from the encoder, is then passed through the reconstructor to produce a Splatter Image representation. This representation is subsequently backprojected to create 3D Gaussian splats.

Teaser 2

Including skip connections helps preserve high-frequency details from the input view in the AutoReconstructor, improving the faithfulness of appearance.

Experiments

Teaser 2

Qualitative results on the "Hydrants" category from the CO3D dataset. We produce significantly sharper results than PixelNeRF, and comparable or better performance on object areas compared to DFM, while being significantly faster.

Teaser 2

Qualitative results on the "TeddyBears" category from the CO3D dataset. Our model produces sharper results with higher-quality details compared to the baselines LGM and SplatterImage, especially in occluded areas.

Teaser 2

Qualitative Results on the RealEstate10K Dataset. Our method achieves comparable performance to DFM, a diffusion-based NeRF model, and performs better in some challenging regions (highlighted with dotted boxes), while being significantly faster at inference time.

Teaser 2

3D Generative Performance. Our diffusion model demonstrates the ability to (1) sample diverse outputs in ambiguous situations, and (2) fill in missing areas using 3D priors learned from large datasets with multi-view consistency. Note the model is trained purely from only 2D images.

Teaser 2

Diverse samples from our diffusion model on Hydrants in the CO3D dataset. We intentionally show three samples with increasing diversity from left to right by controlling the classifier-free guidance and skip connection weights. The samples transition from faithful reconstruction of the input image to diverse generations exhibiting variations in texture, shape, and style. In contrast, the baseline DFM exhibits only small texture variations on object areas.

We present more quantitative and qualitative results in our main paper and Supplementary Materials.

BibTeX

@article{liao2025complete,
  author    = {Liao, Ziwei and Sayed, Mohamed and Waslander, Steven L. and Vicente, Sara and Turmukhambetov, Daniyar and Firman, Michael},
  title     = {Complete Gaussian Splats from a Single Image with Denoising Diffusion Models},
  journal   = {arXiv preprint arXiv:2508.21542},
  year      = {2025},
}