Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features

1Cornell University, 2Visym Labs
Interpolate start reference image.

Visual aliasing, or doppelgangers, poses severe challenges to 3D reconstruction. We propose Doppelganger++, an enhanced pairwise image classifier that excels in visual disambiguation across diverse and challenging scenes. (Top) We seamlessly integrate Doppelganger++ into SfM, successfully disambiguating each scene. (Middle) Compared to prior work (which we refer to as DG-OG), Doppelgangers++ is more robust for everyday scenes, showing improved accuracy and robustness. We show pairs that DG-OG classifies incorrectly and ours gets correct. (Bottom) Our new VisymScenes dataset, featuring complex daily scenes, is particularly challenging for COLMAP and DG-OG, but our method can achieve correct and complete reconstructions.

Abstract

Accurate 3D reconstruction is frequently hindered by visual aliasing, where visually similar but distinct surfaces (aka, doppelgangers), are incorrectly matched. These spurious matches distort the structure-from-motion (SfM) process, leading to misplaced model elements and reduced accuracy. Prior efforts addressed this with CNN classifiers trained on curated datasets, but these approaches struggle to generalize across diverse real-world scenes and can require extensive parameter tuning. In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. Our contributions include a diversified training dataset that incorporates geo-tagged images from everyday scenes to expand robustness beyond landmark-based datasets. We further propose a Transformer-based classifier that leverages 3D-aware features from the MASt3R model, achieving superior precision and recall across both in-domain and out-of-domain tests. Doppelgangers++ integrates seamlessly into standard SfM and MASt3R-SfM pipelines, offering efficiency and adaptability across varied scenes. To evaluate SfM accuracy, we introduce an automated, geotag-based method for validating reconstructed models, eliminating the need for manual inspection. Through extensive experiments, we demonstrate that Doppelgangers++ significantly enhances pairwise visual disambiguation and improves 3D reconstruction quality in complex and diverse scenarios.

SfM Results Comparison

Videos of COLMAP reconstruction results (top) and MASt3R-SfM w/ Doppelganger++ results (bottom). Examples are from MegaScenes dataset. COLMAP results are broken due to doppelganger issues, causing collapsed structures and incorrect geometry. Our method successfully disambiguates the scenes, producing more accurate and complete reconstructions. (Please expand the videos for better viewing.) Click the links below the videos to view the 3D models on our web viewer.

The VisymScenes Dataset

We introduce a new dataset, which includes residential areas, landmarks, historical sites, business districts, and more. This dataset contains 258K images with GPS/IMU metadata, recorded at 149 sites in 42 cities and 15 countries. It includes everyday rural and suburban scenes designed to complement the notable landmarks available on Wikimedia Commons.

We mine doppelganger and true matching pairs from the VisymScenes Dataset leveraging their recorded geo-locations and viewing directions. Here are 4 examples. The top row shows subsets of images captured within each site. The bottom row displays pairs of visually similar but geographically distinct images from each site along with their recorded geolocations on a map. These examples demonstrate that doppelganger issues are prevalent in everyday scenes, presenting significant challenges for reliable 3D reconstruction and image matching.

Improved Doppelganger Classifier

Model design. (Left) Given an image pair, we first create a symmetrized version of the pair and feed it into the frozen MASt3R model. Multi-layer features are extracted from each decoder branch, concatenated, and fed into two learnable doppelganger classification heads. Each head generates predictions supervised by cross-entropy loss. (Right) We use multi-layer decoder features and a Transformer-based classifier head for doppelganger prediction.

Assessing Doppelganger Correction in SfM

There is currently no reliable benchmarking method for evaluating the accuracy and correctness of SfM reconstructions specifically in terms of how well they address the doppelganger issue. We propose to leverage mapping sites like Mapillary, which provide images with location metadata that can serve as probes for validating a 3D model. We first collect sequences of geo-tagged Mapillary images around the target location and register them to the SfM model. Then, we use RANSAC to align the registered cameras and their geolocations. The inlier ratio (IR) is computed as an indicator of model accuracy. (Bottom) In the model corrupted by doppelganger pairs, the registered cameras all collapse to one side. We see that the camera poses estimated with COLMAP (right, in red) do not align well with the geotags (green), leading to a low inlier ratio, but our method leads to a much closer alignment.

Evaluation on Pairwise Visual Disambiguation

We evaluate DG-OG and our method trained on DG only and DG+VisymScenes (two numbers per cell) on three test sets. Both DG-OG and ours benefit from dataset expansion, whereas ours gained more generalizability on out-of-domain test set (Mapillary) after training on both. Our classifier constantly demonstrates better precision, recall across all settings.

Evaluation on SfM Reconstruction Disambiguation

We compare reconstructions from COLMAP, DG-OG, and our method. τ=0.8 is used unless otherwise stated. The '+' symbol indicates split reconstruction components; e.g., DG-OG and our method split the Radcliffe Camera reconstruction into two components. Because VisymScenes scenes are large, we report statistics on the largest reconstruction component produced by COLMAP, and identify the corresponding (split) components in the results of DG-OG and ours.

SfM disambiguation on MegaScenes

DG-OG fails to disambiguate this scene, predicting incorrect scores for image pairs. Our method correctly splits the model into two clean components. (White background) SfM results from DG-OG and ours. (Black background) Verification using geo-tagged images, red points represent registered cameras and green points represent geolocations, inlier ratio (IR) is labeled on the bottom right.

SfM disambiguation on VisymScenes

Our classifier is more robust than DG-OG on test scenes from new domains, like everyday street scenes. DG-OG has difficulty disambiguating such scenes, leading to incorrect geometry and entengled components.

Combine with MASt3R-SfM

While MASt3R’s features are powerful, MASt3R-SfM is not free from doppelganger issues. Although our classifier was trained on image pairs mined through COLMAP’s feature matching module (i.e. w/ SIFT features), it effectively prunes incorrect matches generated by MASt3R, restoring accurate reconstruction results.

Acknowledgment

We thank Joseph Tung and Brandon Li for their valuable discussions and help with the webviewer. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number 140D0423C0035. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. This research was also supported in part by the National Science Foundation under award IIS-2212084.