Learn Monocular SLAM with OpenCV

Introduction

To understand how AR works with a monocular camera, I did some research in my free time and explored SLAM methodologies using popular repositories such as ORB_SLAM2/3, VINS-Mono, Twitchslam, and others. The purpose of SLAM is to estimate the camera’s position and orientation in the world frame while simultaneously building a point cloud map. Then I decided to implement it myself using available libraries from OpenCV to gain a deeper understanding.

For the full implementation, please visit the following repository: https://github.com/longpth/RoLabsSlam.

Part 1: Initialization

Step 1: Camera Pose Estimation by Feature Matching Between 2 Consecutive Images

In order to estimate the camera pose in the world frame, we need to estimate the camera transformation between 2 consecutive camera frames. The relation between 2 camera frames is their matching key points of their 2 consecutive images.

C++ Code:

void Frame::detectAndCompute2(const cv::Mat& image) {
    // Parameters for goodFeaturesToTrack
    int maxCorners = 2000;
    double qualityLevel = 0.01;
    double minDistance = 10.0;
    int blockSize = 3;
    bool useHarrisDetector = false;
    double k = 0.04;

    // Detect good features to track in the cell
    std::vector goodFeatures;
    cv::goodFeaturesToTrack(image, goodFeatures, maxCorners, qualityLevel, minDistance, 
        cv::Mat(), blockSize, useHarrisDetector, k);

    // Convert good features to keypoints
    std::vector goodKeypoints;
    cv::KeyPoint::convert(goodFeatures, goodKeypoints);

    // Compute descriptors for the good keypoints in the current cell
    cv::Mat goodFeaturesDescriptors;
    _orb->compute(image, goodKeypoints, goodFeaturesDescriptors);

    // Set the class member variables to the merged results
    _keypoints = goodKeypoints;
    _descriptors = goodFeaturesDescriptors;
}

Images Taken from KITTI Dataset

The images used in this implementation are taken from the KITTI Vision Benchmark Suite (cvlibs.net).

Essential Matrix Calculation

After we have matched points between 2 images, we can compute the Essential matrix which relates corresponding points in two images of the same scene captured by two cameras. It is defined as:

$$ \mathbf{p_2}^T \cdot E \cdot \mathbf{p_1} = 0 $$

Where:

p1 is the point in the first image (in normalized coordinates).
p2 is the point in the second image (in normalized coordinates).
E is the essential matrix (a 3x3 matrix).

We can use OpenCV to calculate the Essential matrix as follows:

C++ Code:

cv::Mat EstimateEssentialMatrix(const Frame& frame1, const Frame& frame2, const std::vector& good_matches, const cv::Mat& cameraMatrix) {
    // Extract the matched keypoints
    std::vector points1, points2;
    for (const auto& match : good_matches) {
        points1.push_back(frame1.KeyPoints()[match.queryIdx].pt);
        points2.push_back(frame2.KeyPoints()[match.trainIdx].pt);
    }

    // Compute the Essential Matrix using RANSAC to handle outliers
    cv::Mat essential_matrix = cv::findEssentialMat(points1, points2, cameraMatrix, cv::RANSAC, 0.999, 1.0);
    return essential_matrix;
}

After having the Essential Matrix, we can extract the relative rotation (R) and translation (t) between the two camera views with cv2.recoverPose().

Note that the translation t is an up to scale translation. In other words, the magnitude of the translation is arbitrary. For real robotics applications, you will need to determine the actual scale of the scene to get the correct translation value. Without the correct scale, the translation vector will only provide the direction of movement, but not the real-world distance.

For monocular setups, one way to determine the scale is to combine the visual data with additional sensors like an IMU (Inertial Measurement Unit) or wheel odometry. The IMU can provide acceleration and orientation information, while wheel odometry can measure the actual distance traveled by the robot. Using these sources of information together, you can estimate the true scale and calculate the correct translation.

Step 2: Reconstruct 3D Point Cloud After Estimating Camera Pose

After having the transformation by combining rotation and translation between 2 camera views from step 1, we do the triangulation on the 2 sets of 2D matched points from 2 images into 3D space. Triangulation involves solving a system of linear equations to find the 3D point that projects to the corresponding 2D points in both images.

The goal is to find a point XXX in 3D space that, when projected into both camera views using the projection matrices P1 and P2, gives the corresponding 2D points p1 and p2.

The projection matrix P can be written as:

P = K [ R | t ]

Where:

K is the intrinsic camera matrix, which includes the camera's internal parameters like focal lengths and principal point.
R is the rotation matrix, which describes the orientation of the camera.
t is the translation vector, which describes the position of the camera relative to the world coordinate system.
[ R | t ] is the extrinsic matrix, which describes how the camera is positioned in the world.

We can use the cv::sfm::triangulatePoints (InputArrayOfArrays points2d, InputArrayOfArrays projection_matrices, OutputArray points3d) API from OpenCV, or we can re-implement it as the following:

C++ Code:

  std::vector MyTriangulatePoints(
      const cv::Mat& P1, const cv::Mat& P2,
      const std::vector& points1,
      const std::vector& points2,
      std::vector& mask,
      int& goodParralaxCnt)
  {
      assert(points1.size() == points2.size()); // Size of both point vectors must be equal

      std::vector points3D;

      for (size_t i = 0; i < points1.size(); i++)
      {
          // Construct matrix A
          cv::Mat A(4, 4, CV_64F);
          A.row(0) = points1[i].x * P1.row(2) - P1.row(0);
          A.row(1) = points1[i].y * P1.row(2) - P1.row(1);
          A.row(2) = points2[i].x * P2.row(2) - P2.row(0);
          A.row(3) = points2[i].y * P2.row(2) - P2.row(1);

          // Perform SVD decomposition
          cv::Mat u, w, vt;
          cv::SVD::compute(A, w, u, vt);

          // The 3D point in homogeneous coordinates is the last column of V^T (or row of V in OpenCV)
          cv::Mat X = vt.row(3).t();

          // Convert homogeneous coordinates to 3D
          X /= X.at<double>(3);  // Normalize by W to get (X, Y, Z, 1)

          cv::Point3d pt3d = cv::Point3d(X.at<double>(0), X.at<double>(1), X.at<double>(2));

          points3D.emplace_back(X.at<double>(0), X.at<double>(1), X.at<double>(2));

          if (CalculateCosParallax(P1, P2, pt3d) < 0.9998)
          {
              mask.push_back(true);
              goodParralaxCnt++;
          }
          else
          {
              mask.push_back(false);
          }
      }

      return points3D;
  }

In the above function MyTriangulatePoints, I referred to OpenCV and ORB-SLAM2 to remove points with small parallax, which are not good for depth estimation.

Part 2: Tracking

After the initialization process, we already have the initial point cloud and the camera transformation from the first camera view to the second one.

In the tracking process:

Step 1: Apply Motion Model

We apply a motion model: currentPose = velocity * previousPose to have the initial pose for the current frame.

Step 2: Track 3D Points

We try to track 3D points in the current frame by projecting these points to the image frame.

Step 3: Pose Optimization

With the set of 2D projected points and respective 3D points, we perform Pose Optimization for the current pose estimation using the g2o library. The current camera pose is updated with the optimized pose.

Step 4: Create New 3D Map Points

Check if we need to create new 3D map points. If yes, create new 3D points from the current frame keypoints that were not matched in the projection search. We use Optical Flow to find the 2D points in the previous frame, then triangulate them. If no match is found using Optical Flow, match the points using BFMatcher.

Step 5: Request Bundle Adjustment

Perform Bundle Adjustment using the g2o library.

Step 6: Calculate Camera Velocity

Calculate the camera velocity using the formula: velocity = currentPose * previousPose_invert, for the next camera pose estimation.

Robotics Learner

Friday, September 6, 2024

Learn monocular SLAM with OpenCV

Introduction

Part 1: Initialization

Step 1: Camera Pose Estimation by Feature Matching Between 2 Consecutive Images

C++ Code:

Images Taken from KITTI Dataset

Essential Matrix Calculation

C++ Code:

Step 2: Reconstruct 3D Point Cloud After Estimating Camera Pose

C++ Code:

Part 2: Tracking

Step 1: Apply Motion Model

Step 2: Track 3D Points

Step 3: Pose Optimization

Step 4: Create New 3D Map Points

Step 5: Request Bundle Adjustment

Step 6: Calculate Camera Velocity

Wednesday, August 7, 2024

Saturday, July 27, 2024

Simple Object Tracking with Android Phone

Simple Object Tracking Camera with Android Phone

Things used in this project

Hardware components

Software apps and online services

Story

Source code

Demonstration

Monday, July 15, 2024

Thursday, July 4, 2024