Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Questions about the Structure From Motion Pipeline

I've been trying to implement a simple SFM pipeline in OpenCV for a project and I'm having a bit of trouble.

It's for uncalibrated cameras so I don't have a camera matrix (Yes, I know it's going to make things much more complicated and ambiguous).
I know that I should be reading a lot more before attempting something like this but I'm quite hard pressed for time and I'm trying to read about things as I come across them.

Here's my current pipeline I've gathered from a number of articles, code samples and books. I've posted questions about specific steps after it and would also like to know is there something I'm missing in this or something I'm doing wrong?

Here's my current pipeline.

  1. Extract SIFT/SURF Keypoints from the images.
  2. Pairwise Matching of Images.
    1. During Pairwise Matching I run the "Ratio Test" to reduce the number of keypoints.
    2. (Not sure about this) I read that calculating Fundamental Matrix (RANSAC) and getting rid of the outliers from matches further helps it.

      Q) Do I need to even do this? Is it too much Or should I be doing something else like Homography to avoid the Degenerate case of the 8-Point?

  3. Next, I need to choose 2 images to begin the reconstruction with.

    1. I find the number of Homography Inliers between image pairs. I iterate through a list of image pairs in order of most number of % inliers.
    2. I calculate the Fundamental Matrix.
    3. I "guess" a K matrix and calculate Essential Matrix with the formula in Hartley's.
    4. I decompose this Essential Matrix with SVD and then verify the 4 solutions.
      • I used the logic from Wikipedia's entry and this python gist to implement my checks.

        Q) Is this right? Or should I just triangulate the points and then determine if they are in front of the camera or not or does it work out to the same thing?

    5. If there was some problem finding Essential Matrix then skip it and check the next image pair
  4. Set P=[I|0] and P1=[R|T], perform Triangulation and store the 3d points in some Data Structure. Also store the P matrices.

  5. Run a Bundle Adjustment Step with a large-ish number of iterations to minimize error.

    It gets a little hazy from here and I'm pretty sure I'm messing something up.

  6. Choose the next Image to add based off the most number of 3d points it has observed.

  7. Estimate the pose of this new Image from already known 3D points using something like PnPRasnac. Use the values of R & t as it's projective Matrix P1=[R|t]
  8. Triangulate this new image with all (I know, I don't need to do it with ALL of them) the images triangulated so far using their P matrices as P=PMatrices[ImageAlreadyTriangulated] and P1 obtained above.

    Q) Is it really as simple as just using the original value of P we have used? Will that get everything into the same coordinate space? As in, will the triangulated points all be the same system as those obtained from the initial values of P and P1 or do I need to do some kind of transformation here?

  9. From the points we obtain from triangulation, only add those 3D points that we don't already have stored.

  10. Run a Bundle Adjustment after every couple of images
  11. Back to step 6 till all images are added.

General questions:

  • Should I be using undistort for the points or something even though my camera matrix K is only a guess?
  • For bundle adjustment, I'm outputting the points to a file in the Bundle Adjustment at Large (BAL) format. Should I be converting them to World Coordinate Space by R=R' & T=-RT or just leave them be?

I know this must have made for a long read. Thank you very much for your time :)

like image 516
user3380068 Avatar asked Apr 25 '14 09:04

user3380068


2 Answers

The pipeline you propose is generally correct. Except 3.1.

2.2) Correct. RANSAC picks points at random to estimate the fundamental matrix and is robust enough to outliers (as long as you have enough valid matches of course). Homography outliers are NOT necessarily bad matches and so homography should not be used to filter matches.

3.1) Incorrect: Homography inliers are matches that are perfectly aligned in both views, for example points that exhibit proportional or similar movement between the 2 views. What this means is, the higher the number of homography inliers in a view pair, the LESS the ViewPair is a good candidate as a seed for Baseline Triangulation. The camera matrices of such 2 views from a Fundamental matrix estimated with RANSAC will most likely come out inaccurate and the reconstuction will never pick up. What you want to do instead, is start with the ViewPair that has the LOWEST percentage of homography inliers, and still a high number of matches. Unfortunately the Image Pairs that have the highest number of matches also usually have the highest number of homography inliers. This is due to the fact that usually those pairs contain very little camera movement...

3.4) What I do is try the triangulation using all 4 possible Camera matrix ambiguations. R1|t1, R1|t2, R2|t1, R2|t2

8) Yes

like image 127
Francois Zard Avatar answered Nov 02 '22 10:11

Francois Zard


I can recommend this article; https://github.com/godenlove007/master-opencv-book/tree/master/Chapter4_StructureFromMotion

In order to build it, you will need SSBA and PCL libraries as prerequisites. SSBA is quite simple to build but PCL can be tricky if you are planning to use Visual Studio 2013. In that case, you have to build PCL's prerequisites from source and that will take some time.

Once you build this project, you can check how that guy did it and compare with your ideas.

like image 37
karttinen Avatar answered Nov 02 '22 09:11

karttinen