FastForward: A Scene is Worth a Thousand Features

Abstract

Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

FastForward

We introduce FastForward, a network that predicts query coordinates in 3D scene space relative to a collection of mapping images with known poses. FastForward represents the scene as a random set of features sampled from mapping images, and returns the estimate for a query w.r.t. all mapping images in a single feed-forward pass. In the figure below, we show how results improve when FastForward uses an increasing number of mapping images, as returned by image retrieval. Note that we can sample the same number of mapping features, and hence, FastForward's query runtime and GPU memory demand remains roughly constant in all three examples.

FastForward uses a ViT feature extractor to compute the features from the query and mapping images. Mapping features and camera pose information are used to generate the map representation of the scene. For each mapping feature used in the map, we define a ray embedding that encodes the corresponding camera's position and viewing direction. Visual features and ray embedding are fused into a single feature vector. The map representation is created by randomly sampling a fixed number of mapping features. A Transformer block and a DPT head are used to predict the 3D query points in the map coordinate system. Finally, 2D-3D correspondences are used to estimate the query pose.

Visual Localization

We show examples of FastForward, doing mapping and localization of query images in a single forward-pass. Instead of using all the available mapping images, we represent the scene by selecting the top K images based on a retrieval step. For outdoor data we use 20 mapping images, while indoor examples use 10 mapping images. From each mapping image we sample 20% of the features. We show the predicted query pose in blue and the ground truth pose in green. We also show the camera trajectories for the mapping images in gray, and additionally display the camera frustum for the mapping images used in the prediction.

There are two videos per dataset. Use the left and right navigation buttons to switch between the two scenes.

s00219

s00406

The Rock

Bears

Shop Facade

King's College

Scene2a-light

Scene2a-dark

Extreme Localization Scenarios

We evaluate FastForward against the baseline methods Reloc3r and MASt3R in extreme visual localization scenarios. For the Wayspots scenes, we only display Reloc3r estimates as MASt3R was trained on this dataset. In the Wayspots scenes, we highlight two challenging scenarios: opposing shots (i.e., mapping and query scans taken from opposite viewpoints) and symmetric scenes (i.e., scene with similar appearance from different viewpoints). We also present results on the Cambridge dataset to demonstrate how the methods generalize to scenes with scale ranges unseen during training.

There are two videos per dataset. Use the left and right navigation buttons to switch between the two scenes.

Opposing viewpoints (Lawn)

Symmetric scene (Square Bench)

Unseen Large-scale Scene (St Mary's Church)

Unseen Large-scale Scene (Old Hospital)

BibTeX

If you find this work useful for your research, please consider citing our paper:

@inproceedings{barroso2025fastforward,
    title={A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features},
    author={Barroso-Laguna, Axel and Cavallari, Tommaso and Prisacariu, Victor and Brachmann, Eric},
    booktitle={arxiv},
    year={2025}
}

A Scene is Worth a Thousand Features:

Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna1 Tommaso Cavallari1 Victor Adrian Prisacariu1,2 Eric Brachmann1