In computer vision, what does MVS do that SFM can't?

Tags:

structure-from-motion

I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).

One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.

Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.

Taken from CMVS's own homepage:

You should ALWAYS use CMVS after Bundler and before PMVS2

I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:

Bundler -> CMVS -> PMVS2

536

asked Aug 30 '16 02:08

smeeb

1 Answers

Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.

The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.

Structure from Motion

The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.

Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.

MultiView Stereo

Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.

How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.

There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.

Conclusion

Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2: http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf

In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.

133

answered Oct 04 '22 20:10

Soravux

Related questions
                            
                                How to identify different objects in an image?
                            
                                Scoreboard digit recognition using OpenCV
                            
                                Count the number of "holes" in a bitmap
                            
                                Deep neural network skip connection implemented as summation vs concatenation? [closed]
                            
                                How to convert an image into character segments?
                            
                                findChessboardCorners fails for calibration image
                            
                                Segmenting License Plate Characters
                            
                                Android video frame processing
                            
                                How do you judge the (real world) distance of an object in a picture?
                            
                                how to perform stable eye corner detection?
                            
                                How to implement pixel-wise classification for scene labeling in TensorFlow?
                            
                                Remove white borders from segmented images
                            
                                How to improve the homography accuracy?
                            
                                How to detect curves in a binary image?
                            
                                TensorFlow: does tf.train.batch automatically load the next batch when the batch has finished training?
                            
                                Converting a 2D image point to a 3D world point
                            
                                What is endpoint error between optical flows?
                            
                                Does TensorFlow by default use all available GPUs in the machine?
                            
                                Data augmentation techniques for small image datasets?
                            
                                OpenCV ORB detector finds very few keypoints

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With