ACE-G

Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

ICCV 2025

Leonard Bruns

Axel Barroso-Laguna

Tommaso Cavallari

Áron Monszpart

Sowmya Munukutla

Victor Adrian Prisacariu

Eric Brachmann

pdf arXiv YouTube Poster arXiv

Abstract


Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

5-Minute Presentation


Scene-Agnostic Coordinate Regressor


Most scene coordinate regressors are light-weight networks trained fully per scene. These networks map patch features to their corresponding 3D coordinate. Instead, we split the scene-specific regressor into a set of scene-specific map codes and a scene-agnostic coordinate regressor. The scene-specific part is still found through backpropagation-based optimization, but the regressor can be pre-trained specifically for improving generalization.

Split scene-specific regressor up into scene-agnostic regressor and scene-specific map codes.

Query Pre-Training for Improved Generalization


We pre-train the scene-agnostic coordinate regressor by mimicking its later use case: first, find the map codes given a set of mapping images; then, estimate the scene coordinates for query images different from the mapping images. These query images can contain novel viewpoints or other scene changes such as lighting and object placement.

We implement this idea by alternating between mapping and query iterations across hundreds of scenes in parallel (three shown in the illustration below). During mapping iterations, the map codes and the regressor are optimized jointly. During query iterations, only the regressor is optimized. This latter part explicitly aims to force the network to generalize from the mapping data (that was moved into the map codes) to query data.

Pre-training architecture split into mapping pre-training and quer-pretraining.

Mapping


Once the scene-agnostic regressor is pre-trained, we can find the map codes for new scenes through optimization. Starting from a randomly initialized map codes, we optimize the negative log-likelihood of the reprojection error after projecting the coordinates to the image plane. While we currently pre-train with 3D supervision, our method does not require 3D information at mapping time for new scenes.

Split scene-specific regressor up into scene-agnostic regressor and scene-specific map codes.

Relocalization


Once the map codes for a new scene are optimized, we can relocalize query images within the scene. We use the estimated uncertainty to prefilter the estimated 2D-3D correspondences before finding the pose using a robust PnP solver.

To relocalize an image, we first estimate the 2D-3D correspondences, prefilter them based on the estimated uncertainty, and use a PnP solver to find its pose.

Below we visualize the estimated scene coordinates of ACE-G for different scenes. In addition, we show the ground-truth camera pose, and the estimates of ACE-G and ACE.

Indoor-6
RIO10

BibTeX


If you find this work useful for your research, please consider citing our paper:

@inproceedings{bruns2025aceg,
  title={{ACE-G}: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training},
  author={Bruns, Leonard and Barroso-Laguna, Axel and Cavallari, Tommaso and Monszpart, {\'{A}}ron and Munukutla, Sowmya and Prisacariu, Victor and Brachmann, Eric},
  booktitle={arXiv preprint arXiv:2510.11605},
  year={2025}
}