I'm currently trying to implement an alternate method to webcam-based AR using an external tracking system. I have everything in my environment configured save for the extrinsic calibration. I decided to use cv::solvePnP() as it supposedly does pretty much exactly I want, but after two weeks I am pulling my hair out trying to get it to work. A diagram below shows my configuration. c1 is my camera, c2 is the optical tracker I'm using, M is the tracked marker attached to the camera, and ch is the checkerboard.

As it stands I pass in my image pixel coordinates acquired with cv::findChessboardCorners(). The world points are acquired with reference to the tracked marker M affixed to the camera c1 (The extrinsic is thus the transform from this marker's frame to the camera origin).  I have tested this with data sets up to 50 points to mitigate the possibility of local minima, but for now I'm testing with only four 2D/3D point pairs. The resulting extrinsic I get from the rvec and tvec returned from cv::solvePnP() is massively off in terms of both translation and rotation relative to both a ground truth extrinsic I manually created as well as a basic visual analysis (The translation implies a 1100mm distance while the camera is at most 10mm away). 
Initially I thought the issue was that I had some ambiguities in how my board pose was determined, but now I'm fairly certain that's not the case. The math seems pretty straightforward and after all my work on setting the system up, getting caught up on what is essentially a one-liner is a huge frustration. I'm honestly running out of options, so if anyone can help I would be hugely in your debt. My test code is posted below and is the same as my implementation minus some rendering calls. The ground truth extrinsic I have that works with my program is as follows (Basically a pure rotation around one axis and a small translation):
1     0     0     29
0   .77  -.64   32.06
0   .64   .77  -39.86
0     0     0     1
Thanks!
#include <opencv2\opencv.hpp>
#include <opencv2\highgui\highgui.hpp>
int main()
{
    int imageSize     = 4;
    int markupsSize   = 4;
    std::vector<cv::Point2d> imagePoints;
    std::vector<cv::Point3d> markupsPoints;
    double tempImage[3], tempMarkups[3]; // Temp variables or iterative data construction  
    cv::Mat CamMat = (cv::Mat_<double>(3,3) <<  (566.07469648019332), 0, 318.20416967732666, 
        0,  (565.68051204299513), -254.95231997403764, 0,  0, 1);  
    cv::Mat DistMat = (cv::Mat_<double>(5,1) <<  -1.1310542849120900e-001, 4.5797249991542077e-001,
    7.8356355644908070e-003, 3.7617039978623504e-003, -1.2734302146228518e+000);
    cv::Mat rvec = cv::Mat::zeros(3,1, cv::DataType<double>::type);
    cv::Mat tvec = cv::Mat::zeros(3,1,cv::DataType<double>::type);
    cv::Mat R;
    cv::Mat extrinsic = cv::Mat::eye(4, 4, CV_64F);
    // Escape if markup lists aren't equally sized
    if(imageSize != markupsSize)
    {
    //TODO: Replace with try, throw error code, and catch in qSlicerLocalizationModuleWidget
    return 0;
    }
    // Four principal chessboard corners only
    imagePoints.push_back(cv::Point2d(368.906, 248.123));
    imagePoints.push_back(cv::Point2d(156.583, 252.414));
    imagePoints.push_back(cv::Point2d(364.808, 132.384));
    imagePoints.push_back(cv::Point2d(156.692, 128.289));
    markupsPoints.push_back(cv::Point3d(495.115, 39.106, 93.79));  
    markupsPoints.push_back(cv::Point3d(463.143, -86.286, -51.178));  
    markupsPoints.push_back(cv::Point3d(500.236, 121.988, 24.056));  
    markupsPoints.push_back(cv::Point3d(471.276, -3.23, -127.809));    
    // Larger data set
    /*imagePoints.push_back(cv::Point2d(482.066, 233.778));
    imagePoints.push_back(cv::Point2d(448.024, 232.038));
    imagePoints.push_back(cv::Point2d(413.895, 230.785));
    imagePoints.push_back(cv::Point2d(380.653, 229.242 ));
    imagePoints.push_back(cv::Point2d(347.983, 227.785));
    imagePoints.push_back(cv::Point2d(316.103, 225.977));
    imagePoints.push_back(cv::Point2d(284.02, 224.905));
    imagePoints.push_back(cv::Point2d(252.929, 223.611));
    imagePoints.push_back(cv::Point2d(483.41, 200.527));
    imagePoints.push_back(cv::Point2d(449.456, 199.406));
    imagePoints.push_back(cv::Point2d(415.843, 197.849));
    imagePoints.push_back(cv::Point2d(382.59, 196.763));
    imagePoints.push_back(cv::Point2d(350.094, 195.616));
    imagePoints.push_back(cv::Point2d(317.922, 194.027));
    imagePoints.push_back(cv::Point2d(286.922, 192.814));
    imagePoints.push_back(cv::Point2d(256.006, 192.022));
    imagePoints.push_back(cv::Point2d(484.292, 167.816));
    imagePoints.push_back(cv::Point2d(450.678, 166.982));
    imagePoints.push_back(cv::Point2d(417.377, 165.961));
    markupsPoints.push_back(cv::Point3d(457.132, 59.822, 89.247));  
    markupsPoints.push_back(cv::Point3d(451.634, 42.015, 69.719));  
    markupsPoints.push_back(cv::Point3d(447.06, 22.927, 48.635));  
    markupsPoints.push_back(cv::Point3d(442.424, 4.454, 28.659));  
    markupsPoints.push_back(cv::Point3d(437.621, -14.395, 7.495));  
    markupsPoints.push_back(cv::Point3d(433.386, -33.034, -12.009));  
    markupsPoints.push_back(cv::Point3d(429.227, -51.001, -32.269));  
    markupsPoints.push_back(cv::Point3d(424.291, -70.266, -52.667));  
    markupsPoints.push_back(cv::Point3d(460.300, 79.769, 69.948));  
    markupsPoints.push_back(cv::Point3d(455.020, 61.379, 49.306));  
    markupsPoints.push_back(cv::Point3d(450.501, 43.288, 30.250));  
    markupsPoints.push_back(cv::Point3d(446.062, 24.572, 8.713));  
    markupsPoints.push_back(cv::Point3d(441.346, 5.823, -10.997));  
    markupsPoints.push_back(cv::Point3d(436.670, -13.135, -31.428));  
    markupsPoints.push_back(cv::Point3d(432.367, -31.428, -51.785));  
    markupsPoints.push_back(cv::Point3d(427.745, -50.016, -72.519));  
    markupsPoints.push_back(cv::Point3d(464.824, 101.129, 52.251));  
    markupsPoints.push_back(cv::Point3d(458.628, 81.864, 30.810));  
    markupsPoints.push_back(cv::Point3d(454.120, 63.546, 10.458));  */
    // Calculate camera pose
    cv::solvePnP(cv::Mat(markupsPoints), cv::Mat(imagePoints), CamMat, DistMat, rvec, tvec);
    cv::Rodrigues(rvec, R);
    // Invert results of Rodrigues by transposing rotation matrix and calculating inverted tvec
    R = R.t();  
    tvec = -R * tvec; // translation of inverse
    std::cout << "Tvec = " << std::endl << tvec << std::endl;
    // Copy R and tvec into the extrinsic matrix
    extrinsic( cv::Range(0,3), cv::Range(0,3) ) = R * 1; 
    extrinsic( cv::Range(0,3), cv::Range(3,4) ) = tvec * 1; 
    // Fill last row of homogeneous transform (0,0,0,1)
    double *p = extrinsic.ptr<double>(3);
    p[0] = p[1] = p[2] = 0; p[3] = 1;
    std::cout << "Extrinsic = " << std::endl << extrinsic << std::endl << std::endl;
    std::cout << "Extrinsic (inv) = " << std::endl << extrinsic.inv() << std::endl;
    std::cin >> tempImage[0];
    return 0;
}
EDIT 1: I tried normalizing the pixel values using Chi Xu's method (xn = (x-cx)/f, yn = (y-cy)/f). No luck :(
EDIT 2: As it seems that almost everyone who uses solvePnP uses a method where they mark the checkerboard corners as the vectors for their world frame and origin, I'm going to try registering my tracker to the checkerboard. If this works as expected, the first coordinate I mark will be approximately <0,0>. The resulting transform from solvePnP will then be multiplied by the inverse of the world-to-camera-marker transformation, thus resulting in (hopefully) the extrinsic matrix.
EDIT 3: I performed the steps described in step 2. After establishing a coordinate system on the checkerboard, I calculated the transform cTw from the checkerboard space to the world space and verified it in my rendering environment. I then calculated the extrinsic mTc representing the transform from the camera space to the checkerboard space. The camera marker transform wTr was acquired from the tracking system. Finally, I took all of the transforms and multiplied them as follows to move my transformation all the way from camera origin to camera marker:
mTc*cTw*wTr
Lo and behold it gave the the exact same result. I'm dying here for any sign of what I'm doing wrong. If anyone has any clue, I beg of you to help.
EDIT 4: Today I realized that I had configured my checkerboard axes in such a way that they were in a left-handed coordinate system. Since OpenCV operates using standard right-hand form, I decided to retry the tests with my checkerboard frame axes configured in a right-hand fashion. While the results were about the same, I did notice that the world-to-checkerboard transform was now a non-standard transform format, i.e. the 3x3 rotation matrix was no longer roughly symmetrical around the diagonal as it should be. This indicates that my tracking system is not using a right-hand coordinate system (Would have been great if they had documented that. Or anything, for that matter). While I'm not sure how to resolve this, I am certain that it is part of the issue. I'll keep hammering at it and hope someone here knows what to do. I've also added a diagram provided to me on the OpenCV boards by Eduardo. Thanks Eduardo!
So I figured out the issue. Unsurprisingly, it was in the implementation, not the theory. When the project was initially designed, we utilized a webcam-based tracker. This tracker had the z-axis coming out of the marker plane. When we moved to an optical tracker, the code was ported mostly as-is. Unfortunately for us (RIP 2 months of my life), we never checked if the z-axis still came out of the marker plane (It didn't). So the render pipeline was assigned the wrong view up and view out vectors to the scene camera. It's mostly working now, except that the translations are off for some reason. Entirely different problem though!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With