Structure-from-Motion (SfM)

Structure-from-motion (SfM) is the process of reconstructing a 3D structure from 2D images captured from different viewpoints. The fundamental principle behind SfM is about finding correspondences between images and performing incremental reconstruction to generate a sparse point cloud and an estimation of the camera poses for the captured imagery.  COLMAP, the state-of-the-art open source implementation [1] is built based on interesting techniques for feature extraction, triangulation and handling of spurious features to build a point cloud from 2D images. Understanding how these techniques help robots to reconstruct is important for applications across different domains.

COLMAP is built as a feature-based reconstruction platform. The initial stage of visual SfM is to perform correspondence search which includes feature extraction, matching and geometric verification. This step is similar to feature-based robotics applications such as ORB-SLAM [2]. Given direct methods rely on intensity of images, feature-based reconstructions are reliable for robotic applications with higher overhead expenses during deployment.

Correspondence Search

​​COLMAP implements correspondences among images by extracting a set of local keypoints using scale-invariant feature transform (SIFT). This allows robustness against geometric changes, and provides a degree of acceptance of noise among images. Once the features are extracted, COLMAP allows you to perform different kinds of matching from exhaustive matching, where every image is matched against every other image to sequential matching in which adjacent image pairs are matched. In cases where the images have change in intensity among them or higher noise, there can be spurious features. These spurious features refer to incorrect correspondences. Projective transformation is estimated between images to geometrically verify the features and such geometric relations between the cameras are described by homography. COLMAP also uses random sample consensus (RANSAC) [3] to eliminate outlier-contaminated correspondences.

Incremental Reconstruction

COLMAP performs incremental reconstruction following correspondence search, to generate pose estimation for registered images and a point cloud of the captured imaging sequence. Initialization is one of the most important factors in incremental reconstruction as reconstruction might never recover from a bad initialization. Dense location in an image graph with many overlapping cameras is ideal for robust reconstruction, adding required redundancy.

Following initialization of the model, new images are registered to the model by known 2D-3D correspondences between the features from new images and reconstructed 3D points associated with the features of the previously registered images. This is implemented by solving the Perspective-n-Point (PnP) problem which involves estimation of poses and intrinsic parameters. Since a newly registered image may observe new points and existing reconstructed 3D points, COLMAP performs triangulation such that new points are added to the 3D reconstruction only if they have been observed by at least a single previously registered image. This adds more stability to the model via redundancy.

Bundle adjustment [4] jointly refines the 3D reconstructed points and camera parameters by minimising reprojection error, accounting for the triangulation errors. Reprojection error is calculated between a point in image space projected from 3D reconstruction and the corresponding keypoint used in triangulation. This problem is a highly non-linear, non-convex optimization problem and is solved using Levenberg-Marquardt. The special structure of parameters in bundle adjustment problem also motivates the use of Schur complement trick, in which the reduced camera system is solved first and points are updated via back-substitution. This scheme is comparingly more efficient, since the number of cameras is usually smaller than the number of points.

Current state-of-the-art SfM algorithms can handle a diverse and complex range of large-scale internet photo collections but frequently fail to produce complete robust reconstruction. COLMAP addresses this challenge and significantly improves the results over the current state of the art by introducing a geometric verification strategy that augments scene graph with  information, next best view selection, robust triangulation method and interactive bundle adjustment, re-triangulation and outlier filtering strategies. 

In the context of robotic vision, where robots explore unknown environments with higher degree of uncertainties, correspondence search fails at situations where SIFT feature detection fails to produce true robust features. However, interesting imaging techniques such as LiFF [5] can be used as an altering layer and/or BurstSfM [6] can be used as additional later on improving robotic vision with COLMAP.


  1. Schonberger, J.L. and Frahm, J.M., 2016. Structure-from-motion revisited. Computer vision and pattern recognition pp. 4104-4113.
  2. Mur-Artal, R., Montiel, J.M.M. and Tardos, J.D., 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on robotics31(5), pp. 1147-1163.
  3. Fischler, M.A. and Bolles, R.C., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM24(6), pp.381-395.
  4. Triggs, B., McLauchlan, P.F., Hartley, R.I. and Fitzgibbon, A.W., 1999. Bundle adjustment—a modern synthesis. International workshop on vision algorithms, pp. 298-372.
  5. Dansereau, D.G., Girod, B. and Wetzstein, G., 2019. LiFF: Light field features in scale and depth. Computer Vision and Pattern Recognition pp. 8042-8051.
  6. Ravendran, A., Bryson, M. and Dansereau, D.G., 2021. Burst Imaging for Light-Constrained Structure-From-Motion. IEEE Robotics and Automation Letters7(2), pp.1040-1047.
%d bloggers like this: