# Computer Vision – Feature Detection

Feature Detection is basically finding key points or pixels in the image. An immediate question would be to ask what a key point is. A key point is a point which is unique in the local area around it and can be found and matched to a corresponding point in another image.

Aperture Problem

The following diagram will make it clear which points can and cannot be good key points.

The diagram above shows the same edge in two images where features are to be matched. If the point from where all the arrows originate is taken as a key point, there is no unique point in the second image to which the point can be matched. This is because all the points on the edge appear to the same. The arrows thus show that the point selected as a key point can be matched with many points on the edge, which is not desirable. We thus search for corner points which are locally unique.

Interest Point Detection

There is a mathematical formulation for searching corners in an image. We won’t go into the mathematical details but I will mention the logic behind it. For every pixel in the image we define an auto-correlation matrix A as follows:

$A = \begin{bmatrix} I_{x}^{2} & I_{x}I_{y}\\ I_{x}I_{y} & I_{y}^{2} \end{bmatrix}$

Here $I_{x}$ is the horizontal derivative and $I_{y}$ is the vertical derivative at that pixel. w is the weighting function, typically a Gaussian. Note that w is not a constant since it varies with the location of the pixel. Let $\lambda_{0}$ and $\lambda_{1}$ be the eigenvalues of this matrix. It can be proved that a pixel is a corner point if both $\lambda_{0}$ and $\lambda_{1}$ are ‘big’ in value. In a certain sense, a pixel is a key point if it’s gradient in both the directions is big. There are various quantities defined, which if found to be above a certain threshold ensure that both $\lambda_{0}$ and $\lambda_{1}$ are big. Some of them are listed below:

Szeliski Detector
$det(A) / trace(A) = \lambda_{0} \lambda_{1} / (\lambda_{0} + \lambda_{1})$

Harris Detector
$det(A) - \alpha trace(A)^{2} = \lambda_{0} \lambda_{1} - \alpha ( \lambda_{0} + \lambda_{1} )^{2}$

$\alpha$ is typically taken to be 0.06

Tomasi Detector
$min(\lambda_{0}, \lambda_{1})$

Triggs Detector
$\lambda_{0} - \alpha \lambda_{1}$

$\alpha$ is typically taken to be 0.06

The algorithm thus is as follows:

Convert the image to grayscale and blur using a Gaussian
Compute the horizontal and vertical derivatives of the image $I_{x}$ and $I_{y}$
Compute the three images corresponding to the outer products of these gradients ($I_{x}^{2}$, $I_{x}I_{y}$ and $I_{y}^{2}$)
Convolve each of these images with a larger Gaussian
For each pixel in the original image
Construct the auto-correlation matrix A using the three images ($I_{x}^{2}$, $I_{x}I_{y}$ and $I_{y}^{2}$)
Compute one of the quantities mentioned above (Szeliski, Harris, Tomasi, Triggs)
If the point is a local maxima (using a particular threshold) then report it as a key point

The OpenCV code of the above algorithm can be found here.

There are no specific thresholds for these quantities and thus have to be tried out randomly. Out of the above, Harris Detector is the one most widely used because it is computationally cheap compared to the others (since calculation of eigenvalues is not required) and also gives good features. The following image shows the result of applying the above quantities on Lenna:

### One comment

1. […] Bag of words is a basically a simplified representation of an image. Its actually a concept taken form Natural Language Processing where you represent documents as an unordered collection of words disregarding grammar. Translating this into CV jargon, it means that we simplify images by picking out features from an image and representing it as a collection of features. A good explanation of what features are can be found at my friend, Siddharth’s blog here. […]