Elements of Computer Vision: Multiple View Geometry.

§ 7. Getting practical

In this section we will approach estimation problems from a more “practical” point of view.

First, we will discuss how the presence of errors in the data affects our estimates and describe the countermeasures that must be taken to obtain a good estimate.

Second, we will introduce non-linear distortions due to lenses into the pinhole model and we illustrate a practical calibration algorithm that works with a simple planar object.

Finally, we will describe rectification, a transformation of image pairs such that conjugate epipolar lines become collinear and parallel to one of the image axes, usually the horizontal one. In such a way, the correspondence search is reduced to a 1D search along the trivially identified scanline.

§ 7.1. Pre-conditioning

In presence of noise (or errors) on input data, the accuracy of the solution of a linear system depends crucially on the condition number of the system. The lower the condition number, the less the input error gets amplified (the system is more stable).

As Hartley (1995) pointed out, it is crucial for linear algorithms (as the DLT algorithm) that input data is properly pre-conditioned, by a suitable coordinate change (origin and scale): points are translated so that their centroid is at the origin and are scaled so that their average distance from the origin is 2.

This improves the condition number of the linear system that is being solved.

Apart from improved accuracy, this procedure also provides invariance under similarity transformations in the image plane.

§ 7.2. Algebraic vs geometric error

Measured data (i.e., image or world point positions) is noisy.

Usually, to counteract the effect of noise, we use more equations than necessary and solve with least-squares.

What is actually being minimized by least squares?

In a typical null-space problem formulation Ax=0 (like the DLT algorithm) the quantity that is being minimized is the square of the residual Ax.

In general, if Ax can be regarded as a distance between the geometrical entities involved (points, lines, planes, etc..), than what is being minimized is a geometric error, otherwise (when the error lacks a good geometrical interpretation) it is called an algebraic error.

All the linear algorithm (DLT and others) we have seen so far minimize an algebraic error. Actually, there is no justification in minimizing an algebraic error apart from the ease of implementation, as it results in a linear problem.

Usually, the minimization of a geometric error is a non-linear problem, that admit only iterative solutions and requires a starting point.

So, why should we prefer to minimize a geometric error? Because:

  • The quantity being minimized has a meaning

  • The solution is more stable

  • The solution is invariant under Euclidean transforms

Often linear solution based on algebraic residuals are used as a starting point for a non-linear minimization of a geometric cost function, which “gives the solution a final polish” (Hartley and Zisserman,2003).

§ 7.2.1. Geometric error for resection

The goal is to estimate the camera matrix, given a number of correspondences mj,Mjj=1n

The geometric error associated to a camera estimate P is the distance between the measured image point mj and the re-projected point PiMj:

minPjdPMj,mj2 (116)

where d() is the Euclidean distance between the homogeneous points.

The DLT solution is used as a starting point for the iterative minimization (e.g. Gauss-Newton)

§ 7.2.2. Geometric error for triangulation

The goal is to estimate the 3-D coordinates of a point M, given its projection mi and the camera matrix Pi for every view i=1m.

The geometric error associated to a point estimate M in the i-th view is the distance between the measured image point mi and the re-projected point PiM:

minMidPiM,mi2 (117)

where d() is the Euclidean distance between the homogeneous points.

The linear solution is used as a starting point for the iterative minimization (e.g. Gauss-Newton).

§ 7.2.3. Geometric error for F

The goal is to estimate F given a a number of point correspondences mimri.

The geometric error associated to an estimate F is given by the distance of conjugate points from conjugate lines (note the symmetry):

minFjdFmj,mrj2+dFTmrj,mj2 (118)

where d() here is the Euclidean distance between a line and a point (in homogeneous coordinates).

The eight-point solution is used as a starting point for the iterative minimization (e.g. Gauss-Newton).

Note that F must be suitably parametrized, as it has only seven d.o.f.

§ 7.2.4. Geometric error for H

The goal is to estimate H given a a number of point correspondences mimri.

The geometric error associated to an estimate H is given by the symmetric distance between a point and its transformed conjugate:

minHjdHmj,mrj2+dH-1mrj,mj2 (119)

where d() is the Euclidean distance between the homogeneous points. This also called the symmetric transfer error.

The linear solution is used as a starting point for the iterative minimization (e.g. Gauss-Newton).

§ 7.2.5. Bundle adjustment (reconstruction)

If measurements are noisy, the projection equation will not be satisfied exactly by the camera matrices and structure computed in Sec. 5.3.2.

We wish to minimize the image distance between the re-projected point PiMj and measured image points mij for every view in which the 3-D point appears:

minPi,Mji,jdPiMj,mij2 (120)

where d() is the Euclidean distance between the homogeneous points.

As m and n increase, this becomes a very large minimization problem.

A solution is to alternate minimizing the re-projection error by varying Pi with minimizing the re-projection error by varying Mj.

See (Triggs et al.,2000) for a review and a more detailed discussion on bundle adjustment.

§ 7.3. Robust estimation

Up to this point, we have assumed that the only source of error affecting correspondences is in the measurements of point's position. This is a small-scale noise that gets averaged out with least-squares.

In practice, we can be presented with mismatched points, which are outliers to the noise distribution (i.e., rogue measurements following a different, unmodelled, distribution).

These outliers can severely disturb least-squares estimation (even a single outlier can totally offset the least-squares estimation, as illustrated in Fig. 16.)

Figure 16. A single outlier can severely offset the least-squares estimate (red line), whereas the robust estimate (green line) is unaffected.

The goal of robust estimation is to be insensitive to outliers (or at least to reduce sensitivity).

§ 7.3.1. M-estimators

Least squares:

minθiri/σi2 (121)

where θ are the regression coefficient (what is being estimated) and ri is the residual. M-estimators are based on the idea of replacing the squared residuals by another function of the residual, yielding

minθiρri/σi (122)

ρ is a symmetric function with a unique minimum at zero that grows sub-quadratically, called loss function.

Differentiating with respect to θ yields:

i1σiρri/σidridθ=0 (123)

The M-estimate is obtained by solving this system of non-linear equations.

§ 7.3.2. RANSAC

Given a model that requires a minimum of p data points to instantiate its free parameters θ, and a set of data points S containing outliers:

  1. Randomly select a subset of p points of S and instantiate the model from this subset

  2. Determine the set Si of data points that are within an error tolerance t of the model. Si is the consensus set of the sample.

  3. If the size of Si is greater than a threshold T, re-estimate the model (possibly using least-squares) using Si (the set of inliers) and terminate.

  4. If the size of Si is less than T, repeat from step 1.

  5. Terminate after N trials and choose the largest consensus set found so far.

Three parameters need to be specified: t,T and N.

Both T and N are linked to the (unknown) fraction of outliers ϵ.

N should be large enough to have a high probability of selecting at least one sample containing all inliers. The probability to randomly select p inliers in N trials is:

P=1-1-1-ϵpN (124)

By requiring that P must be near 1, N can be solved for given values of p and ϵ.

T should be equal to the expected number of inliers, which is given (in fraction) by 1-ϵ.

At each iteration, the largest consensus set found so fare gives a lower bound on the fraction of inliers, or, equivalently, an upper bound on the number of outliers. This can be used to adaptively adjust the number of trials N.

t is determined empirically, but in some cases it can be related to the probability that a point under the threshold is actually an inlier (Hartley and Zisserman,2003).

As pointed out by Stewart (1999), RANSAC can be viewed as a particular M-estimator.

The objective function that RANSAC maximizes is the number of data points having absolute residuals smaller that a predefined value t. This may be seen a minimising a binary loss function that is zero for small (absolute) residuals, and 1 for large absolute residuals, with a discontinuity at t.

Figure 17. RANSAC loss function

By virtue of the prespecified inlier band, RANSAC can fit a model to data corrupted by substantially more than half outliers.

§ 7.3.3. LMedS

Another popular robust estimator is the Least Median of Squares. It is defined by:

minθmediri (125)

It can tolerate up to 50% of outliers, as up to half of the data point can be arbitrarily far from the “true” estimate without changing the objective function value.

Since the median is not differentiable, a random sampling strategy similar to RANSAC is adopted. Instead of using the consensus, each sample of size p is scored by the median of the residuals of all the data points. The model with the least median (lowest score) is chosen.

A final weighted least-squares fitting is used.

With respect to RANSAC, LMedS can tolerate “only” 50% of outliers, but requires no setting of thresholds.

§ 7.4. Practical calibration

Camera calibration (or resection) as described so far, requires a calibration object that consists typically of two or three planes orthogonal to each other. This might be difficult to obtain, without access to a machine tool.

Zhang (1998) introduced a calibration technique that requires the camera to observe a planar pattern (much easier to obtain) at a few (at least three) different orientation. Either the camera or the planar pattern can be moved by hand.

Instead of requiring one image of many planes, this method requires many images of one plane.

We will also introduce here a more realistic camera model that takes into account non-linear effects produced by lenses.

In each view, we assume that correspondences between image points and 3-D points on the planar pattern have been established.

Figure 18. Images of a planar calibration pattern. The points used for calibration are the corners of the black squares.

§ 7.4.1. Estimating intrinsic parameters

Following the development of Sec. 4.6 we know that for a camera P=K[R|t] the homography between a world plane at z=0 and the image is

HKr1,r2,t (126)

where ri are the column of R.

Suppose that H is computed from correspondences between four or more known world points and their images, then some constraints can be obtained on the intrinsic parameters, thanks to the fact that the columns of R are orthonormal.

Writing H=h1,h2,h3, from the previous equation we derive:

r1=λK-1h1andr2=λK-1h2 (127)

where λ is an unknown scale factor.

The orthogonality r1Tr2=0 gives

λ2h1TKKT-1h2=0 (128)

or, equivalently (remember that ω=(KKT)-1)

h1Tωh2=0 (129)

Likewise, the condition on the norm r1Tr1=r2Tr2 gives

h1Tωh1=h2Tωh2 (130)

Introducing the Kronecker product as usual, we rewrite these two equations as:

h1Th2Tvecω=0 (131)
h1Th1T-h2Th2Tvecω=0 (132)

As ω is a 3×3 symmetric matrix, its unique elements (the unknowns) are only six. This fact can be neatly taken into account using the vech operator. The above equations are equivalent to 6

The duplication matrix Dn is the unique n2nn+1/2 matrix which, transforms vechA into vecA: DnvechA=vecA.
:

h2Th1TD3vechω=0 (133)
h1Th1T-h2Th2TD3vechω=0 (134)

From a set of n images, we obtain a 2×n×6 coefficient matrix A by stacking up two equations for each image. The solution is the 1-dimensional right null-space of A.

At least five equations are needed (212 images). In practice, for a good calibration, one should use around 12 views).

§ 7.4.2. Estimating extrinsic parameters

K is obtained from the Cholesky factorization of ω, then R and t are recovered from:

r1r2t=1K-1h1K-1h1h2h3r3=r1×r2 (135)

Because of noise, the matrix R is not guaranteed to be orthogonal, hence we need to recover the closest orthogonal matrix.

Let R=QS be the polar decomposition of R. Then Q is the closest possible orthogonal matrix to R in Frobenius norm.

In this way we have obtained the camera matrix P by minimizing an algebraic distance which is not geometrically meaningful.

It is advisable to refine it with a (non-linear) minimization of a geometric error:

minPii=1nj=1mdPiMj,mij2 (136)

where Pi=K[Ri|ti] and the rotation has to be suitably parametrized with three parameters (see Rodrigues formula).

The linear solution is used as a starting point for the iterative minimization (e.g. Gauss-Newton).

§ 7.4.3. Radial distortion

A realistic model for a photocamera or a videocamera must take into account non-linear distortions introduced by the lenses, especially when dealing with short focal lengths or low cost devices (e.g. webcams, disposable cameras).

Figure 19. Radial distortion

The more relevant effect is the radial distortion, which is modeled as a non-linear transformation from the ideal (undistorted) pixel coordinates u,v to the observed (distorted) pixel coordinates u,v:

{u=u-u01+k1rd2+u0v=v-v01+k1rd2+v0. (137)

where rd2=u-u0au2+v-v0av2 and u0,v0 are the coordinates of the image centre.

This distortion makes a rectangle to appear as a “barrel” or a “cushion” (as in Fig. 19) depending whether the coefficient k1 is lower or greater than one.

§ 7.4.4. Estimating k1

Let us assume that the pinhole model is calibrated. The point m=u,v projected according to the pinhole model (undistorted) do not coincide with the measured points m=u,v because of the radial distortion.

We wish to recover k1 from Eq. (137). Each point gives two equation:

{u-u0u-u0au2+v-v0av2k1=u-uv-u0u-u0au2+v-v0av2k1=v-v (138)

hence a least squares solution for k1 is readily obtained from n>1 points.

When calibrating a camera we are required to simultaneously estimate both the pinhole model's parameters and the radial distortion coefficient.

The pinhole calibration we have described so far assumed no radial distortion, and the radial distortion calibration assumed a calibrated pinhole camera.

The solution (a very common one in similar cases) is to alternate between the two estimation until convergence.

Namely: start assuming k=0, calibrate the pinhole model, then use that model to compute radial distortion. Once k1 is estimated, refine the pinhole model by solving Eq. (136) with the radial distortion in the projection, and continue until the image error is small enough.

§ 7.5. Practical multi-view reconstruction

The multi-view projective reconstruction method based on factorization, albeit elegant, is of little practical use, because it assumes that all the point are visible in all the views. The method for Euclidean reconstruction, based on the factorization of essential matrices is more effective, but it does not scale well with large scale scenarios. The best results in the state of the art have been demonstrated by pipelines based on the idea of growing partial reconstructions, composed by cameras and points, where new cameras are added by resection and new points by intersection. Frequent bundle adjustments avoid drift. Some examples of such pipelines are (Brown and Lowe,2005; Kamberov et al.,2006; Snavely et al.,2006; Vergauwen and Gool,2006; Irschara et al.,2007), and (Gherardi et al.,2010). An example of the latter is shown in Fig. 20.

Figure 20. An example of reconstruction obtained with Samantha (Gherardi et al.,2010).