Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Matias Turkulainen1*Akshay Krishnan2Filippo Aleotti3Mohamed Sayed3
Guillermo Garcia-Hernando3 Juho Kannala1,4Arno Solin1,5Gabriel Brostow3,6
Daniyar Turmukhambetov3

1Aalto University     2Georgia Tech     3Niantic Spatial     4University of Oulu     5ELLIS Institute Finland     6UCL

CVPR 2026 — Denver, Colorado, June 3–7, 2026
Paper Code & Data BibTeX
Cross-View Splatter teaser

TL;DR: A feed-forward 3D Gaussian splatting model that fuses GPS-tagged ground photos with orthorectified satellite imagery for improved outdoor novel-view synthesis.

Abstract


We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level and by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird’s-eye feature representations, our model improves scene coverage and novel-view synthesis compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite–terrain data mined from open mapping services, and evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery, allowing comparison to prior state-of-the-art methods.

Method


We introduce a new task, view synthesis with georeferenced imagery, where the goal is to synthesize novel views of outdoor scenes from sparse ground-level photos and a single orthorectified satellite image.

Overview of the Cross-View Splatter method

Our Cross-View Splatter model adapts the VGGT architecture and directly predicts 3D Gaussian splats in a unified 3D coordinate frame for both ground and satellite images. Ground and satellite images are encoded into a shared feature space via cross-attention, and separate heads regress Gaussian attributes for each branch. The full model is trained on curated georeferenced datasets and paired satellite–terrain data mined from public mapping and elevation sources, and we evaluate it on a new benchmark for georeferenced novel view synthesis.

Reconstructions


Example outputs on scenes not seen during training.

Predicted height maps, confidences, and Gaussian splats for held-out scenes

Left to right: input ground images, input satellite image, predicted height map, height confidence (black: low, red: high), predicted ground Gaussians, and combined ground + satellite Gaussians.

Results


Benchmark. To support a fair comparison with prior baselines and Cross-View Splatter, we manually geolocalize 10 scenes from Tanks and Temples and 40 from DL3DV — aligning each COLMAP reconstruction to publicly available satellite imagery in a shared coordinate frame. The full list of scenes, alignment data, and instructions for fetching the corresponding satellite tiles will be available in our GitHub repository.

Sparse-view novel-view synthesis on our georeferenced Tanks & Temples and DL3DV benchmarks (2 context views). Our combined ground + satellite reconstruction matches or beats prior feed-forward methods on both datasets. See the paper for the full 1-, 2-, and 3-context-view tables and per-scene breakdowns.

Method T&T PSNR ↑ T&T LPIPS ↓ DL3DV PSNR ↑ DL3DV LPIPS ↓
MVSplat 6.930.69976.270.7174
DepthSplat 9.610.60778.580.6774
NoPoSplat 8.970.683011.010.6665
Long-LRM 8.530.70549.740.6813
AnySplat 9.850.577310.370.5702
Cross-View Splatter (Ours) 11.67 0.5984 12.10 0.5940
Where the satellite prior helps most.

Compared to the AnySplat baseline, we improve PSNR across all overlap bins — with the largest gains at low context-vs-target IoU.

Stratified PSNR vs. context/target IoU on Tanks and Temples

PSNR vs. context/target IoU on geolocalized Tanks & Temples (5% bins).

Acknowledgments


We thank Zawar Qureshi and Jakub Powierza for compute infrastructure support and Alan Paul for help in generating terrain data. MT, JK, and AS acknowledge funding from the Research Council of Finland (362408, 339730).

BibTeX


If you find this work useful for your research, please consider citing our paper:

@inproceedings{turkulainen2026crossviewsplatter,
    title     = {{Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images}},
    author    = {Turkulainen, Matias and Krishnan, Akshay and Aleotti, Filippo and Sayed, Mohamed and Garcia-Hernando, Guillermo and Kannala, Juho and Solin, Arno and Brostow, Gabriel and Turmukhambetov, Daniyar},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2026}
}