Feature Detection

Feature detection is one of the well-established techniques in computer vision to perform robotic applications such as simultaneous localization and mapping (SLAM), structure-from-motion (SfM), motion-based image segmentation, object recognition and tracking, stereo-calibration and navigation. A feature in an image is a region which has specific properties and/or structures like an edge, point or a blob and these features are identified in an image and matched with corresponding features from other images for the above mentioned applications.

An edge feature is a good indicator of boundaries and occlusions. Such features are matched based on their orientation and appearance profile whereas a keypoint feature and a corner are defined as localised features based on the surrounding pixels.

Feature detectors extract an interest point based on its properties. A feature should have a well-defined position, stable under local and global perturbations such as brightness variation, should provide efficient detection and reliable computation with high degree of repeatability. Given these are the properties of an interest point, we can detect such features based on brightness of an image by finding image derivatives. We can also extract boundaries with curvature analysis or edge detection.

Scale invariant feature transform (SIFT) finds a set of distinctive keypoints, defines a region around each keypoint, extracts and normalises the region content, computes a local descriptor from the normalised region and matches local descriptors.

Objects around us bring meaning to the context only at a certain scale. A plant of smaller scale is called shrub and of larger scale is called tree with other differences but multi-scale nature of an object is common in the environment. Scale-space replicates multi-scale on digital images by convolving input images with different scales of the Gaussian kernel. In SIFT, this scale space is further divided as octaves and scale, and are designed based on the size of an input image. Image size of a specific octave is half the size of the previous octave and since SIFT starts with an input image, double its own size, an input image size of 100 x 100 pixels will have octaves of image size 200 x 200, 100 x 100, 50 x 50, 25 x 25 and so on. While the number of octaves depends on the input image size, 4 octaves are generally considered adequate.

Within each octave, images are blurred continuously using Gaussian blur operator. This operation adds blur of different scale values to an image. Each image, thus, have multiple octaves and each octave has multiple scale values, creating a continuous scale variation in images. For more reading, refer to [1]. SIFT generates a set of images by finding the difference of Gaussians (DoG) between two scale levels of each octave. This stack of images are used as the search space in finding keypoints. A pixel in an image is compared with 8 neighbours around in addition to 9 pixels on the previous scale and 9 pixels on the next scale. If the specific pixel is a local extrema, it is a potential keypoint best represented at that specific scale. These keypoints are refined by peak threshold tuning, which eliminates keypoints if their intensity is less than a specific value (default: 0.03). If the keypoints correspond to edges, SIFT computes principal curvature using Hessian matrix during edge threshold tuning (default: 10).

After peak threshold and edge threshold tuning, we have refined keypoints that are stable features. SIFT at this stage knows the location (x, y) and the scale (σ) of these stable features. Orientation assignment ensures the SIFT algorithm to be not only scale-invariance but also rotation invariance. SIFT assigns orientation by calculating gradient magnitude and gradient direction of the neighbourhood for each keypoint based on its scale. The algorithm creates an orientation histogram with 36 bins covering 360 degrees, so for example if a certain point has a magnitude direction of 12.148 degree, it will be a part of 10-19-degree bin and will have a value proportional to the gradient magnitude. Orientations are assigned to peaks that are 80% and above from the histogram resulting in keypoints with the same location and scale but different directions, creating all possible SIFT features of an image.

The following image shows SIFT features extracted on an image at default settings.

Considering other feature detectors, speeded up robust features (SURF) are built on the same principles as SIFT and detect blobs in an image. However, details in each step of detection, description and matching are different between SIFT and SURF. While SIFT uses DoG to find interest points, SURF uses square-shaped filters as an approximation of Gaussian smoothing.

Oriented FAST and rotated BRIEF (ORB) features are also similar to SIFT and built on features from accelerated segment test (FAST) keypoint detector and binary robust independent elementary features (BRIEF) descriptor. While ORB performs better than SURF, FAST features do not have an orientation component. Though ORB features use multi-scale image pyramids, these features are partial-scale invariants. Harris corner detection algorithm instead computes spatial derivatives of a grayscale image, construct a structure tensor (second-moment matrix), calculates Harris response and performs non-maximum suppressions.

While we will not be discussing in depth about description and matching in this article, it is important to explore these topics to understand how it affects robotic applications.

Description

Description is the local appearance around each feature point that are in general invariant under changes in translation, scale, illumination and in-plane rotation. A descriptor vector encodes numerical fingerprints of each feature. There are two major categories of descriptors: global and local. Global descriptors, that describe a whole image, are not reliable enough for robotics applications since robots can have non-rigid motion and changes in part of the image can result in a failure. However, local descriptors resemble shape and appearance from local neighbourhoods which are mostly suitable for matching representation.

Matching

We can identify similar features by comparing descriptors across the images. By doing so, we will get a set of pairs (xa, ya) ↔ (xb, yb) for two images, where (xa, ya) is the location of a feature in one image and (xb, yb) is the location of the same feature in the second image. These features are matched across two images during matching. In robotics, we do this across multiple images to establish correspondences among the same scene. The performance of matching methods depends on both interest points and image descriptors. Thus, it is important to pick appropriate feature detectors and descriptors for robotics applications.

Applications

While we have discussed SIFT feature detection and it’s relevance to robotics applications in this article, there are other feature detectors, descriptors and matchers that have shown good performance in robotics applications. Harris corner algorithms are generally used for applications which require edge understanding such as drone navigation using aerial images, simultaneous localization and mapping around man-made buildings. Blob detectors such as SIFT, SURF, FAST and ORB are excellent detectors for medical applications and are implemented state-of-the-art in reconstruction. For more reading, refer to [1-8].

References

  1. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision60(2), pp.91-110.
  2. Bay, H., Tuytelaars, T. and Van Gool, L., 2006, May. Surf: Speeded up robust features. European Conference on Computer Vision, pp. 404-417.
  3. Rublee, E., Rabaud, V., Konolige, K. and Bradski, G., 2011. ORB: An efficient alternative to SIFT or SURF. International Conference on Computer Vision, pp. 2564-2571.
  4. Mur-Artal, R., Montiel, J.M.M. and Tardos, J.D., 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics31(5), pp.1147-1163.
  5. Dollár, P., Appel, R., Belongie, S. and Perona, P., 2014. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence36(8), pp.1532-1545.
  6. Leutenegger, S., Chli, M. and Siegwart, R.Y., 2011. BRISK: Binary robust invariant scalable keypoints. International Conference on Computer Vision, pp. 2548-2555.
  7. Calonder, M., Lepetit, V., Strecha, C. and Fua, P., 2010, September. Brief: Binary robust independent elementary features. European Conference on Computer Vision, pp. 778-792.
  8. Rosten, E. and Drummond, T., 2006, May. Machine learning for high-speed corner detection. European Conference on Computer Vision, pp. 430-443.
%d bloggers like this: