Skip to main content
Logo image

Coordinated Linear Algebra

Section 10.5 \(QR\) Factorization and Least Square Approximations

One of the main virtues of orthogonal matrices is that they can be easily inverted---the transpose is the inverse. This fact, combined with the factorization theorem in this section, provides a useful way to simplify many matrix calculations.

Definition 10.5.1.

Let \(A\) be an \(m \times n\) matrix with independent columns. A QR-factorization of \(A\) expresses it as \(A = QR\) where \(Q\) is \(m \times n\) with orthonormal columns and \(R\) is an invertible and upper triangular matrix with positive diagonal entries.
The importance of the factorization lies in the fact that there are computer algorithms that accomplish it with good control over round-off error, making it particularly useful in matrix calculations.
The factorization is a matrix version of the Gram-Schmidt process. Suppose
\begin{equation*} A = \left[ \begin{array}{cccc} |\amp |\amp \amp | \\ \mathbf{c}_{1} \amp \mathbf{c}_{2} \amp \cdots \amp \mathbf{c}_{n}\\ |\amp |\amp \amp | \end{array}\right] \end{equation*}
is an \(m \times n\) matrix with linearly independent columns \(\mathbf{c}_{1}, \mathbf{c}_{2}, \dots, \mathbf{c}_{n}\text{.}\) The Gram-Schmidt algorithm can be applied to these columns to provide orthogonal columns \(\mathbf{f}_{1}, \mathbf{f}_{2}, \dots, \mathbf{f}_{n}\) where \(\mathbf{f}_{1} = \mathbf{c}_{1}\) and
\begin{equation*} \mathbf{f}_{k} = \mathbf{c}_{k} - \frac{\mathbf{c}_{k} \cdot \mathbf{f}_{1}}{\norm{ \mathbf{f}_{1} }^2}\mathbf{f}_{1} + \frac{\mathbf{c}_{k} \cdot \mathbf{f}_{2}}{\norm{ \mathbf{f}_{2} }^2}\mathbf{f}_{2} - \dots - \frac{\mathbf{c}_{k} \cdot \mathbf{f}_{k-1}}{\norm{ \mathbf{f}_{k-1} }^2}\mathbf{f}_{k-1} \end{equation*}
for each \(k = 2, 3, \dots, n\text{.}\) Now write \(\mathbf{q}_{k} = \frac{1}{\norm{ \mathbf{f}_{k} }}\mathbf{f}_{k}\) for each \(k\text{.}\) Then \(\mathbf{q}_{1}, \mathbf{q}_{2}, \dots, \mathbf{q}_{n}\) are orthonormal columns, and the above equation becomes
\begin{equation*} \norm{ \mathbf{f}_{k} } \mathbf{q}_{k} = \mathbf{c}_{k} - (\mathbf{c}_{k} \cdot \mathbf{q}_{1})\mathbf{q}_{1} - (\mathbf{c}_{k} \cdot \mathbf{q}_{2})\mathbf{q}_{2} - \dots - (\mathbf{c}_{k} \cdot \mathbf{q}_{k-1})\mathbf{q}_{k-1}. \end{equation*}
Using these equations, express each \(\mathbf{c}_{k}\) as a linear combination of the \(\mathbf{q}_{i}\text{:}\)
\begin{equation*} \begin{array}{ccl} \mathbf{c}_{1} \amp =\amp \norm{ \mathbf{f}_{1} } \mathbf{q}_{1} \\ \mathbf{c}_{2} \amp =\amp (\mathbf{c}_{2} \cdot \mathbf{q}_{1})\mathbf{q}_{1} + \norm{ \mathbf{f}_{2} } \mathbf{q}_{2} \\ \mathbf{c}_{3} \amp =\amp (\mathbf{c}_{3} \cdot \mathbf{q}_{1})\mathbf{q}_{1} + (\mathbf{c}_{3} \cdot \mathbf{q}_{2})\mathbf{q}_{2} + \norm{ \mathbf{f}_{3} } \mathbf{q}_{3} \\ \vdots \amp \amp \vdots \\ \mathbf{c}_{n} \amp =\amp (\mathbf{c}_{n} \cdot \mathbf{q}_{1})\mathbf{q}_{1} + (\mathbf{c}_{n} \cdot \mathbf{q}_{2})\mathbf{q}_{2} + (\mathbf{c}_{n} \cdot \mathbf{q}_{3})\mathbf{q}_{3} + \dots + \norm{ \mathbf{f}_{n} } \mathbf{q}_{n} \end{array} \end{equation*}
These equations have a matrix form that gives the required factorization:
\begin{align} A \amp = \left[ \begin{array}{ccccc} |\amp |\amp |\amp \amp | \\ \mathbf{c}_{1} \amp \mathbf{c}_{2} \amp \mathbf{c}_{3} \amp \cdots \amp \mathbf{c}_{n}\\ |\amp |\amp |\amp \amp | \end{array}\right] \tag{10.5.1}\\ \amp = \left[ \begin{array}{ccccc} |\amp |\amp |\amp \amp | \\ \mathbf{q}_{1} \amp \mathbf{q}_{2} \amp \mathbf{q}_{3} \amp \cdots \amp \mathbf{q}_{n}\\ |\amp |\amp |\amp \amp | \end{array}\right] \left[ \begin{array}{ccccc} \norm{ \mathbf{f}_{1} } \amp \mathbf{c}_{2} \cdot \mathbf{q}_{1} \amp \mathbf{c}_{3} \cdot \mathbf{q}_{1} \amp \cdots \amp \mathbf{c}_{n} \cdot \mathbf{q}_{1} \\ 0 \amp \norm{ \mathbf{f}_{2} } \amp \mathbf{c}_{3} \cdot \mathbf{q}_{2} \amp \cdots \amp \mathbf{c}_{n} \cdot \mathbf{q}_{2} \\ 0 \amp 0 \amp \norm{ \mathbf{f}_{3} } \amp \cdots \amp \mathbf{c}_{n} \cdot \mathbf{q}_{3} \\ \vdots \amp \vdots \amp \vdots \amp \ddots \amp \vdots \\ 0 \amp 0 \amp 0 \amp \cdots \amp \norm{ \mathbf{f}_{n} } \end{array} \right]. \notag \end{align}
(10.5.1) Here the first factor
\begin{equation*} Q = \left[ \begin{array}{ccccc} |\amp |\amp |\amp \amp | \\ \mathbf{q}_{1} \amp \mathbf{q}_{2} \amp \mathbf{q}_{3} \amp \cdots \amp \mathbf{q}_{n}\\ |\amp |\amp |\amp \amp | \end{array}\right] \end{equation*}
has orthonormal columns, and the second factor is an \(n \times n\) upper triangular matrix \(R\) with positive diagonal entries (and so is invertible). We record this in the following theorem.
The matrices \(Q\) and \(R\) in Theorem 10.5.2 are uniquely determined by \(A\text{;}\) we return to this below.

Example 10.5.3.

Find the QR-factorization of
\begin{equation*} A = \left[ \begin{array}{rrr} 1 \amp 1 \amp 0 \\ -1 \amp 0 \amp 1 \\ 0 \amp 1 \amp 1 \\ 0 \amp 0 \amp 1 \end{array}\right]\text{.} \end{equation*}
Answer.
Denote the columns of \(A\) as \(\mathbf{c}_{1}\text{,}\) \(\mathbf{c}_{2}\text{,}\) and \(\mathbf{c}_{3}\text{,}\) and observe that \(\{\mathbf{c}_{1}, \mathbf{c}_{2}, \mathbf{c}_{3}\}\) is independent. If we apply the Gram-Schmidt algorithm to these columns, the result is:
\begin{equation*} \mathbf{f}_{1} = \mathbf{c}_{1} = \left[ \begin{array}{r} 1 \\ -1 \\ 0 \\ 0 \end{array}\right], \quad \mathbf{f}_{2} = \mathbf{c}_{2} - \frac{1}{2}\mathbf{f}_{1} = \left[ \def\arraystretch{1.2} \begin{array}{r} \frac{1}{2} \\ \frac{1}{2} \\ 1 \\ 0 \end{array}\right], \mbox{ and } \end{equation*}
\begin{equation*} \mathbf{f}_{3} = \mathbf{c}_{3} + \frac{1}{2}\mathbf{f}_{1} - \mathbf{f}_{2} = \left[ \begin{array}{r} 0 \\ 0 \\ 0 \\ 1 \end{array}\right]. \end{equation*}
Write \(\mathbf{q}_{j} = \frac{1}{\norm{ \mathbf{f}_{j} }^2}\mathbf{f}_{j}\) for each \(j\text{,}\) so \(\{\mathbf{q}_{1}, \mathbf{q}_{2}, \mathbf{q}_{3}\}\) is orthonormal. Then (10.5.1) precedingTheorem 10.5.2 gives \(A = QR\) where
\begin{align*} Q \amp = \left[ \begin{array}{ccc} | \amp | \amp | \\ \mathbf{q}_{1} \amp \mathbf{q}_{2} \amp \mathbf{q}_{3} \\ | \amp | \amp | \end{array}\right] = \left[ \def\arraystretch{1.3}\begin{array}{ccc} \frac{1}{\sqrt{2}} \amp \frac{1}{\sqrt{6}} \amp 0 \\ \frac{-1}{\sqrt{2}} \amp \frac{1}{\sqrt{6}} \amp 0 \\ 0 \amp \frac{2}{\sqrt{6}} \amp 0 \\ 0 \amp 0 \amp 1 \end{array}\right] = \frac{1}{\sqrt{6}} \left[ \begin{array}{ccc} \sqrt{3} \amp 1 \amp 0 \\ -\sqrt{3} \amp 1 \amp 0 \\ 0 \amp 2 \amp 0 \\ 0 \amp 0 \amp \sqrt{6} \end{array} \right] \\ R \amp = \left[ \begin{array}{ccc} \norm{ \mathbf{f}_{1} } \amp \mathbf{c}_{2} \cdot \mathbf{q}_{1} \amp \mathbf{c}_{3} \cdot \mathbf{q}_{1} \\ 0 \amp \norm{ \mathbf{f}_{2} } \amp \mathbf{c}_{3} \cdot \mathbf{q}_{2} \\ 0 \amp 0 \amp \norm{ \mathbf{f}_{3} } \\ \end{array} \right] = \left[ \def\arraystretch{1.5} \begin{array}{ccc} \sqrt{2} \amp \frac{1}{\sqrt{2}} \amp \frac{-1}{\sqrt{2}} \\ 0 \amp \frac{\sqrt{3}}{\sqrt{2}} \amp \frac{\sqrt{3}}{\sqrt{2}} \\ 0 \amp 0 \amp 1 \end{array} \right] = \frac{1}{\sqrt{2}}\left[ \begin{array}{ccc} 2 \amp 1 \amp -1 \\ 0 \amp \sqrt{3} \amp \sqrt{3} \\ 0 \amp 0 \amp \sqrt{2} \end{array} \right] \end{align*}
The reader can verify that indeed \(A = QR\text{.}\)
If a matrix \(A\) has independent rows and we apply QR-factorization to \(A^{T}\text{,}\) the result is:
Since a square matrix with orthonormal columns is orthogonal, we have:
We now take the time to prove the uniqueness of the QR-factorization.

Proof.

Write
\begin{equation*} Q = \left[ \begin{array}{cccc} |\amp |\amp \amp | \\ \mathbf{c}_{1} \amp \mathbf{c}_{2} \amp \cdots \amp \mathbf{c}_{n}\\ |\amp |\amp \amp | \end{array}\right] \quad \text{and} \quad Q_{1} = \left[ \begin{array}{cccc} |\amp |\amp \amp | \\ \mathbf{d}_{1} \amp \mathbf{d}_{2} \amp \cdots \amp \mathbf{d}_{n}\\ |\amp |\amp \amp | \end{array}\right] \end{equation*}
in terms of their columns, and observe first that \(Q^TQ = I_{n} = Q_{1}^TQ_{1}\) because \(Q\) and \(Q_{1}\) have orthonormal columns. Hence it suffices to show that \(Q_{1} = Q\) (then \(R_{1} = Q_{1}^TA = Q^TA = R\)). Since \(Q_{1}^TQ_{1} = I_{n}\text{,}\) the equation \(QR = Q_{1}R_{1}\) gives \(Q_{1}^TQ = R_{1}R^{-1}\text{;}\) for convenience we write this matrix as
\begin{equation*} Q_{1}^TQ = R_{1}R^{-1} = \left[ \begin{array}{c} t_{ij} \end{array}\right] \end{equation*}
This matrix is upper triangular with positive diagonal elements (since this is true for \(R\) and \(R_{1}\)), so \(t_{ii} \gt 0\) for each \(i\) and \(t_{ij} = 0\) if \(i \gt j\text{.}\) On the other hand, the \((i, j)\)-entry of \(Q_{1}^TQ\) is \(\mathbf{d}_{i}^T\mathbf{c}_{j} = \mathbf{d}_{i} \cdot \mathbf{c}_{j}\text{,}\) so we have \(\mathbf{d}_{i} \cdot \mathbf{c}_{j} = t_{ij}\) for all \(i\) and \(j\text{.}\) But each \(\mathbf{c}_{j}\) is in \(\mbox{span}\{\mathbf{d}_{1}, \mathbf{d}_{2}, \dots, \mathbf{d}_{n}\}\) because \(Q = Q_{1}(R_{1}R^{-1})\text{.}\) We know how to write a vector as a linear combination of an orthonormal basis:
\begin{align*} \mathbf{c}_{j} \amp = (\mathbf{d}_{1} \cdot \mathbf{c}_{j})\mathbf{d}_{1} + (\mathbf{d}_{2} \cdot \mathbf{c}_{j})\mathbf{d}_{2} + \dots + (\mathbf{d}_{n} \cdot \mathbf{c}_{j})\mathbf{d}_{n} \\ \amp = t_{1j}\mathbf{d}_{1} + t_{2j}\mathbf{d}_{2} + \dots + t_{jj}\mathbf{d}_{i}, \end{align*}
because \(\mathbf{d}_{i} \cdot \mathbf{c}_{j} = t_{ij} = 0\) if \(i \gt j\text{.}\) The first few equations here are
\begin{equation*} \begin{array}{ccl} \mathbf{c}_{1} \amp =\amp t_{11}\mathbf{d}_{1} \\ \mathbf{c}_{2} \amp =\amp t_{12}\mathbf{d}_{1} + t_{22}\mathbf{d}_{2} \\ \mathbf{c}_{3} \amp =\amp t_{13}\mathbf{d}_{1} + t_{23}\mathbf{d}_{2} + t_{33}\mathbf{d}_{3} \\ \mathbf{c}_{4} \amp =\amp t_{14}\mathbf{d}_{1} + t_{24}\mathbf{d}_{2} + t_{34}\mathbf{d}_{3} + t_{44}\mathbf{d}_{4} \\ \vdots \amp \amp \vdots \end{array} \end{equation*}
The first of these equations gives
\begin{equation*} 1 = \norm{ \mathbf{c}_{1} } = \norm{ t_{11}\mathbf{d}_{1} } = | t_{11} | \norm{ \mathbf{d}_{1} } = t_{11}, \end{equation*}
whence \(\mathbf{c}_{1} = \mathbf{d}_{1}\text{.}\) But then we have \(t_{12} = \mathbf{d}_{1} \cdot \mathbf{c}_{2} = \mathbf{c}_{1} \cdot \mathbf{c}_{2} = 0\text{,}\) so the second equation becomes \(\mathbf{c}_{2} = t_{22}\mathbf{d}_{2}\text{.}\) Now a similar argument gives \(\mathbf{c}_{2} = \mathbf{d}_{2}\text{,}\) and then \(t_{13} = 0\) and \(t_{23} = 0\) follows in the same way. Hence \(\mathbf{c}_{3} = t_{33}\mathbf{d}_{3}\) and \(\mathbf{c}_{3} = \mathbf{d}_{3}\text{.}\) Continue in this way to get \(\mathbf{c}_{i} = \mathbf{d}_{i}\) for all \(i\text{.}\) This proves that \(Q_{1} = Q\text{.}\)

Subsection 10.5.1 QR-Algorithm for approximating eigenvalues

We learned about an iterative method for computing eigenvalues in the preceding chapter. We also mentioned that a better method for approximating the eigenvalues of an invertible matrix \(A\) depends on the QR-factorization of \(A\text{.}\) While it is beyond the scope of this book to pursue a detailed discussion of this method, we give an example and conclude with some remarks on the QR-algorithm.
The QR-algorithm uses QR-factorization repeatedly to create a sequence of matrices \(A_{1} =A, A_{2}, A_{3}, \dots,\) as follows:
  1. Define \(A_{1} = A\) and factor it as \(A_{1} = Q_{1}R_{1}\text{.}\)
  2. Define \(A_{2} = R_{1}Q_{1}\) and factor it as \(A_{2} = Q_{2}R_{2}\text{.}\)
  3. Define \(A_{3} = R_{2}Q_{2}\) and factor it as \(A_{3} = Q_{3}R_{3}\text{.}\)
  4. \(\displaystyle \vdots\)
In general, \(A_{k}\) is factored as \(A_{k} = Q_{k}R_{k}\) and we define \(A_{k + 1} = R_{k}Q_{k}\text{.}\) Then \(A_{k + 1}\) is similar to \(A_{k}\) [in fact, \(A_{k+1} = R_{k}Q_{k} = (Q_{k}^{-1}A_{k})Q_{k}\)], and hence each \(A_{k}\) has the same eigenvalues as \(A\text{.}\) If the eigenvalues of \(A\) are real and have distinct absolute values, the remarkable thing is that the sequence of matrices \(A_{1}, A_{2}, A_{3}, \dots\) converges to an upper triangular matrix with these eigenvalues on the main diagonal. [See below for the case of complex eigenvalues.]
The example below goes through the whole QR business for a \(2\times 2\)-matrix.

Example 10.5.7.

If
\begin{equation*} A = \left[ \begin{array}{rr} 1 \amp 1 \\ 2 \amp 0 \end{array}\right], \end{equation*}
use the QR-algorithm to approximate the eigenvalues.
Answer.
The matrices \(A_{1}\text{,}\) \(A_{2}\text{,}\) and \(A_{3}\) are as follows:
\begin{equation*} A_{1} = \left[ \begin{array}{rr} 1 \amp 1 \\ 2 \amp 0 \end{array}\right] = Q_{1}R_{1} \end{equation*}
where
\begin{equation*} Q_{1} = \frac{1}{\sqrt{5}}\left[ \begin{array}{rr} 1 \amp 2 \\ 2 \amp -1 \end{array}\right] \mbox{ and } R_{1} = \frac{1}{\sqrt{5}}\left[ \begin{array}{rr} 5 \amp 1 \\ 0 \amp 2 \end{array}\right] \end{equation*}
In the same vein:
\begin{equation*} A_{2} = \frac{1}{5}\left[ \begin{array}{rr} 7 \amp 9 \\ 4 \amp -2 \end{array}\right] = \left[ \begin{array}{rr} 1.4 \amp -1.8 \\ -0.8 \amp -0.4 \end{array}\right]= Q_{2}R_{2} \end{equation*}
with
\begin{equation*} \mbox{ where } Q_{2} = \frac{1}{\sqrt{65}}\left[ \begin{array}{rr} 7 \amp 4 \\ 4 \amp -7 \end{array}\right] \mbox{ and } R_{2} = \frac{1}{\sqrt{65}}\left[ \begin{array}{rr} 13 \amp 11 \\ 0 \amp 10 \end{array}\right] \end{equation*}
And lastly,
\begin{equation*} A_{3} = \frac{1}{13}\left[ \begin{array}{rr} 27 \amp -5 \\ 8 \amp -14 \end{array}\right] = \left[ \begin{array}{rr} 2.08 \amp -0.38 \\ 0.62 \amp -1.08 \end{array}\right]. \end{equation*}
This is converging to
\begin{equation*} \left[ \begin{array}{rr} 2 \amp \ast \\ 0 \amp -1 \end{array}\right] \end{equation*}
and so is approximating the eigenvalues \(2\) and \(-1\) on the main diagonal.

Subsection 10.5.2 Least-Squares Approximation

Often an exact solution to a problem in applied mathematics is difficult or impossible to obtain. However, it is usually just as useful to find an approximation to a solution. In particular, finding ``linear approximations" is a powerful technique in applied mathematics. One basic case is the situation where a system of linear equations has no solution, and it is desirable to find a ``best approximation" to a solution to the system.
We begin by defining the ``best approximation’’ in a natural way, and showing that computing the best approximation reduces to solving a related system of linear equations called the normal equations. Next, we demonstrate a common application where a collection of data points is approximated by a curve.
We conclude this section by showing that \(QR\)-factorization provides us with a more efficient way to solve the normal equations and compute the best approximation.

Exploration 10.5.1.

Let
\begin{equation*} A=\begin{bmatrix}3 \amp 1\\1 \amp 2\\1 \amp 2\end{bmatrix}\quad\text{and}\quad \mathbf{b}=\begin{bmatrix}2\\1\\3\end{bmatrix} \end{equation*}
Consider the matrix equation \(A\mathbf{x}=\mathbf{b}\text{.}\) A quick examination of the last two rows should convince you that this equation has no solutions. In other words, \(\mathbf{b}\) is not in the span of the columns of \(A\text{.}\) If \(\mathbf{z}\) were an exact solution to \(A\mathbf{x}=\mathbf{b}\text{,}\) then \(\mathbf{b}-A\mathbf{z}\) would be \(\mathbf{0}\text{.}\)
Since the equation does not have a solution, we will attempt to find the next best thing to a solution by finding \(\mathbf{z}\) such that \(\norm{\mathbf{b}-A\mathbf{z}}\) is as small as possible. The quantity
\begin{equation*} \norm{\mathbf{b}-A\mathbf{z}} \end{equation*}
is called the error. The following GeoGebra interactive will help you understand the geometry behind finding \(\mathbf{z}\text{.}\) RIGHT-CLICK and DRAG to rotate the image for a better view.
Figure 10.5.8.
Record your best guess for \(\mathbf{z}\) -- you will have a chance to check your answer in Example 10.5.10. Now, here is a few questions to keep you on yor toes.
Problem 10.5.9.
What did you discover about the geometry of minimizing \(\norm{\mathbf{b}-A\mathbf{z}}\text{?}\) Select all that apply.
  1. \(\mathbf{z}\) is orthogonal to the plane spanned by the columns of \(A\text{.}\)
  2. \(\norm{\mathbf{b-A\mathbf{z}}}\) is orthogonal to \(\text{col}(A)\text{.}\)
  3. \(\mathbf{b-A\mathbf{z}}\) is orthogonal to \(\text{col}(A)\text{.}\)
  4. \(A\mathbf{z}\) is orthogonal to \(\text{col}(A)\text{.}\)
  5. \(A\mathbf{z}\) is an orthogonal projection of \(\mathbf{b}\) onto \(\text{col}(A)\text{.}\)
Answer.
Option (c) and (d).
Our geometric observations will help us develop a method for finding \(\mathbf{z}\) .
Suppose \(A\) is an \(m\times n\) matrix, and \(\mathbf{b}\) is a column vector in \(\R^m\text{.}\) Consider the matrix equation \(A\mathbf{x}=\mathbf{b}\text{.}\) If this equation does not have a solution, we can attempt to find a best approximation by finding \(\mathbf{z}\) which minimizes the error, \(\norm{\mathbf{b}-A\mathbf{z}}\text{.}\) The expression \(\norm{\mathbf{b}-A\mathbf{z}}\) is also sometimes called the residual.
The error (or the residual) is given in terms of a vector norm. Recall that our definition of the norm involves the sum of squares of the vector components. When we minimize the norm, we minimize the sum of squares. This is why the method we are describing is often referred to as least squares. We will explore this idea further later in this section.
In the case when \(\text{col}(A)\) is a subspace of \(\R^3\text{,}\) we can see geometrically that \(\mathbf{z}\) is the best approximation if and only if \(A\mathbf{z}\) is an orthogonal projection of \(\mathbf{b}\) onto \(\text{col}(A)\text{,}\) and the error is the magnitude of \(\mathbf{b}-A\mathbf{z}\text{,}\) as shown below.
Error vector pictured
What we observed above, holds in general. We will use this fact to find \(\mathbf{z}\text{.}\) Every vector in \(\text{col}(A)\) can be written in the form \(A\mathbf{x}\) for some \(\mathbf{x}\) in \(\R^m\text{.}\) Our goal is to find \(\mathbf{z}\) such that \(A\mathbf{z}\) is the orthogonal projection of \(\mathbf{b}\) onto \(\text{col}(A)\text{.}\)
By Corollary 10.1.16, every vector \(A\mathbf{x}\) in \(\text{col}(A)\) is orthogonal to \(\mathbf{b}-A\mathbf{z}\text{.}\) This means \(\mathbf{b}-A\mathbf{z}\) is in the orthogonal complement of \(\text{col}(A)\text{,}\) which is \(\text{null}(A^T)\text{.}\)
Therefore, we have
\begin{align} A^T(\mathbf{b}-A\mathbf{z})\amp = \mathbf{0} \tag{10.5.2}\\ A^T\mathbf{b}-A^TA\mathbf{z}\amp = \mathbf{0} \notag\\ A^TA\mathbf{z}\amp = A^T\mathbf{b} \notag \end{align}
Since \(\mathbf{b}-A\mathbf{z}\) is normal to the subspace \(\text{col}(A)\text{,}\) we call the system in (9.6.1) the normal equations for \(\mathbf{z}\text{.}\) If \(A^TA\) is invertible, then we can write
\begin{equation} \mathbf{z}=(A^TA)^{-1}A^T\mathbf{b}.\tag{10.5.3} \end{equation}
We will return to the question of invertibility of \(A^TA\) in Theorem 10.5.11. For now, let’s revisit the problem posed in Exploration 10.5.1.

Example 10.5.10.

We now return to the matrix equation \(A\mathbf{x}=\mathbf{b}\) of Exploration 10.5.1 to find \(\mathbf{z}\) that best approximates a solution.
Answer.
Recall that
\begin{equation*} A=\begin{bmatrix}3 \amp 1\\1 \amp 2\\1 \amp 2\end{bmatrix}\quad\text{and}\quad \mathbf{b}=\begin{bmatrix}2\\1\\3\end{bmatrix}. \end{equation*}
In this case, \((A^TA)^{-1}\) exists. Applying (10.5.3), we compute
\begin{align*} \mathbf{z}\amp = \left(\begin{bmatrix}3 \amp 1 \amp 1\\1 \amp 2 \amp 2\end{bmatrix}\begin{bmatrix}3 \amp 1\\1 \amp 2\\1\amp 2\end{bmatrix}\right)^{-1}\begin{bmatrix}3 \amp 1 \amp 1\\1 \amp 2 \amp 2\end{bmatrix}\begin{bmatrix}2\\1\\3\end{bmatrix} \\ \amp = \begin{bmatrix}11 \amp 7\\7 \amp 9\end{bmatrix}^{-1}\begin{bmatrix}10\\10\end{bmatrix}=\begin{bmatrix}0.18 \amp -0.14\\-0.14 \amp 0.22\end{bmatrix}\begin{bmatrix}10\\10\end{bmatrix} \\ \amp = \begin{bmatrix}0.4\\0.8\end{bmatrix}. \end{align*}
Compare this answer to your guess in Exploration 10.5.1. If your guess was correct, nice job! If your guess was different, try setting \(\mathbf{z}\) to the correct answer and use the GeoGebra interactive in Exploration 10.5.1 to examine the geometry of the problem.
We now come back to the question of when \(A^TA\) is invertible.

Proof.

Let \(A\) be a matrix with linearly independent columns. We will show that \(\left(A^TA\right)\mathbf{x}=\mathbf{0}\) has only the trivial solution. For \(\mathbf{x}\text{,}\) a solution of \(A^TA\mathbf{x}=\mathbf{0}\text{,}\) we have
\begin{align*} \norm{A\mathbf{x}}^2\amp = (A\mathbf{x})\cdot(A\mathbf{x}) \\ \amp = \left(A\mathbf{x}\right)^TA\mathbf{x} \\ \amp = \mathbf{x}^TA^TA\mathbf{x} \\ \amp = \mathbf{x}^T\cdot\mathbf{0}=0. \end{align*}
Therefore \(A\mathbf{x}=\mathbf{0}\text{.}\) By linear independence of the columns of \(A\) we conclude that \(\mathbf{x}=\mathbf{0}\text{.}\)
We summarize our findings in the following theorem.

Example 10.5.13.

The sytem of linear equations
\begin{equation*} \begin{matrix}3x \amp - \amp y\amp =\amp 4\\x\amp +\amp 2y\amp =\amp 0\\2x\amp +\amp y\amp =\amp 1\end{matrix} \end{equation*}
has no solution. Find the vector \(\mathbf{z}\) that best approximates a solution.
Answer.
We have
\begin{equation*} A=\begin{bmatrix}3\amp -1\\1\amp 2\\2\amp 1\end{bmatrix},\quad\begin{bmatrix}4\\0\\1\end{bmatrix}. \end{equation*}
The normal equations are \(\left(A^TA\right)\mathbf{z}=A^T\mathbf{b}\text{.}\) We compute
\begin{equation*} \begin{bmatrix}3\amp 1\amp 2\\-1\amp 2\amp 1\end{bmatrix}\begin{bmatrix}3\amp -1\\1\amp 2\\2\amp 1\end{bmatrix}\mathbf{z}=\begin{bmatrix}3\amp 1\amp 2\\-1\amp 2\amp 1\end{bmatrix}\begin{bmatrix}4\\0\\1\end{bmatrix}, \end{equation*}
\begin{equation*} \begin{bmatrix}14\amp 1\\1\amp 6\end{bmatrix}\mathbf{z}=\begin{bmatrix}14\\-3\end{bmatrix}. \end{equation*}
We observe that is \(A^TA\) is invertible. Multiplying on the left by \((A^TA)^{-1}\) yields
\begin{equation*} \mathbf{z}=\frac{1}{83}\begin{bmatrix}87\\-56\end{bmatrix}. \end{equation*}
With this vector \(\mathbf{z}\text{,}\) the left sides of the equations become
\begin{equation*} \begin{matrix}3(87/83) \amp - \amp (-56/83)\amp \approx\amp 3.82\\(87/83)\amp +\amp 2(-56/83)\amp \approx\amp -0.30\\2(87/83)\amp +\amp (-56/83)\amp \approx\amp 1.42\end{matrix}. \end{equation*}
This is as close as possible to a solution.

Example 10.5.14.

The average number \(g\) of goals per game scored by a hockey player seems to be related linearly to two factors: the number \(x_1\) of years of experience and the number \(x_2\) of goals in the preceding 10 games.
The data on the following page were collected on four players. Find the linear function \(g=a_0+a_1x_1+a_2x_2\) that best fits the data.
\begin{equation*} \begin{array}{|c|c|c|} \hline g\amp x_1\amp x_2\\ \hline 0.8\amp 5\amp 3\\ 0.8\amp 3\amp 4\\ 0.6 \amp 1\amp 5\\ 0.4 \amp 2\amp 1\\ \hline \end{array} \end{equation*}
Answer.
If the relationship is given by \(g=a_0+a_1x_1+a_2x_2\text{,}\) then the data can be described as follows:
\begin{equation*} \begin{bmatrix}1\amp 5\amp 3\\1\amp 3\amp 4\\1\amp 1\amp 5\\1\amp 2\amp 1\end{bmatrix}\begin{bmatrix}a_0\\a_1\\a_2\end{bmatrix}=\begin{bmatrix}0.8\\0.8\\0.6\\0.4\end{bmatrix}. \end{equation*}
Using Theorem 10.5.12, we get
\begin{equation*} \mathbf{z}=\underbrace{\frac{1}{42}\begin{bmatrix} 119\amp -17\amp -19\\-17\amp 5\amp 1\\-19\amp 1\amp 5 \end{bmatrix}}_{\left(A^TA\right)^{-1}}\underbrace{\begin{bmatrix} 1\amp 1\amp 1\amp 1\\5\amp 3\amp 1\amp 2\\3\amp 4\amp 5\amp 1 \end{bmatrix}}_{A^T}\begin{bmatrix} 0.8\\0.8\\0.6\\0.4 \end{bmatrix}=\begin{bmatrix} 0.14\\0.09\\0.08 \end{bmatrix}. \end{equation*}
Hence the best-fitting function is \(g=0.14+0.09x_1+0.08x_2\text{.}\)

Subsection 10.5.3 Application of Least Squares to Curve Fitting

In practice, one can fit a function to a set of data points, so that the graph of the function passes through each of the points as well as possible. However, this is sometimes impossible and may not even be desirable (overfitting). In this section, we will learn how to approximate a collection of data points with a line (or a curve) that fits the ``trend" of the points. We will start with data that fit a linear pattern.

Exploration 10.5.2.

Consider the points \((1,1)\text{,}\) \((2, 3)\) and \((4,4)\text{.}\) These points do not lie on a straight line, but they have a general upward linear trend. (Typically there would be many more points to consider, but we will limit our exploration to what we can do by hand.) Our goal is to find a line that fits these points as closely as possible.
We are looking for a function \(f\) of the form \(f(x)=ax+b\) such that the following infeasible system is satisfied as closely as possible
\begin{align*} a(1)\amp + b\amp = 1 \\ a(2)\amp + b\amp = 3 \\ a(4)\amp + b\amp = 4 \end{align*}
From the first part of this section we know how to find a best approximation. By Theorem 10.5.12, we have
\begin{align*} \mathbf{z}=\begin{bmatrix}a\\b\end{bmatrix}\amp = \left(A^TA\right)^{-1}A^T\mathbf{b} \\ \amp = \left(\begin{bmatrix}1\amp 2\amp 4\\1\amp 1\amp 1\end{bmatrix}\begin{bmatrix} 1\amp 1\\2\amp 1\\4\amp 1\end{bmatrix}\right)^{-1}\begin{bmatrix}1\amp 2\amp 4\\1\amp 1\amp 1\end{bmatrix}\begin{bmatrix} 1\\3\\4 \end{bmatrix} \\ \amp = \begin{bmatrix}13/14\\1/2\end{bmatrix}. \end{align*}
According to our computations, the line that best fits the data is given by
\begin{equation*} f(x)=\frac{13}{14}x+\frac{1}{2}. \end{equation*}
Let’s take a look.
Linear regression
We found this fit by minimizing \(\norm{\mathbf{b}-A\mathbf{z}}\text{.}\) We will now investigate the meaning of this expression in relation to the line and the data points.
\begin{equation} \mathbf{b}-A\mathbf{z}=\begin{bmatrix}1\\3\\4\end{bmatrix}-\begin{bmatrix}(13/14)(1)+1/2\\(13/14)(2)+1/2\\{(13/14)(4)+1/2}\end{bmatrix}\approx\begin{bmatrix} -0.43\\0.64\\-0.21 \end{bmatrix}.\tag{10.5.4} \end{equation}
Observe that each entry of \(\mathbf{b}-A\mathbf{z}\) is the signed vertical distance between a particular point and the line. Instead of computing the error, \(\norm{\mathbf{b}-A\mathbf{z}}\text{,}\) we compute \(\norm{\mathbf{b}-A\mathbf{z}}^2\) to avoid the square root.
\begin{equation} \norm{\mathbf{b}-A\mathbf{z}}^2=(-0.43)^2+0.64^2+(-0.21)^2\approx 0.64.\tag{10.5.5} \end{equation}
Minimizing \(\norm{\mathbf{b}-A\mathbf{z}}\) also minimizes \(\norm{\mathbf{b}-A\mathbf{z}}^2\text{.}\) Therefore, what we have minimized is the sum of squares of the vertical distances between the data points and the line. The following GeoGebra interactive will help you explore this idea.
Figure 10.5.15.
In Exploration 10.5.2 we discovered that \(\norm{\mathbf{b}-A\mathbf{z}}^2\) is the sum of squares of vertical distances between the given data points and the proposed line.
By minimizing \(\norm{\mathbf{b}-A\mathbf{z}}\text{,}\) we minimize the sum of squares of vertical distances. This observation holds in general. Given a collection of points
\begin{equation*} (x_1, y_1),\ (x_2, y_2),\dots ,(x_n, y_n), \end{equation*}
finding a linear function of the form \(f(x)=ax+b\) that best fits the points we would find a best solution to the system
\begin{equation*} \begin{bmatrix}x_1\amp 1\\x_2\amp 1\\\vdots\amp \vdots\\x_n\amp 1\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix}=\begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix} \end{equation*}
by minimizing
\begin{equation*} \norm{\begin{bmatrix}y_1-(ax_1+b)\\y_2-(ax_2+b)\\\vdots\\y_n-(ax_n+b)\end{bmatrix}}^2=(y_1-(ax_1+b))^2+(y_2-(ax_2+b))^2+\dots +(y_n-(ax_n+b))^2 \end{equation*}
A geometric interpretation of \(y_i-(ax_i+b)\) is shown below.
Geometric picture of error
The line we obtain in this fashion is called a line of best fit or a trendline, and the method we used is referred to as the method of least squares. We can apply the method of least squares to find best fitting non-linear functions.

Example 10.5.16.

Find the least squares approximating quadratic polynomial of the form \(f(x)=ax^2+bx+c\) for the following points.
\begin{equation*} (-3, 3), \ (-1, 1), \ (0, 1), \ (1, 2), \ (3, 4). \end{equation*}
Answer.
We are looking for an approximate solution to the system of equations
\begin{align*} a(-3)^2\amp + b(-3) + c = 3 \\ a(-1)^2\amp + b(-1) + c = 1 \\ a(0)^2\amp + b(0) + c = 1 \\ a(1)^2\amp + b(1) + c = 2 \\ a(3)^2\amp + b(3) + c = 4 \end{align*}
This corresponds to the matrix equation
\begin{equation*} \begin{bmatrix}9\amp -3\amp 1\\1\amp -1\amp 1\\0\amp 0\amp 1\\1\amp 1\amp 1\\9\amp 3\amp 1\end{bmatrix}\begin{bmatrix}a\\b\\c\end{bmatrix}=\begin{bmatrix}3\\1\\1\\2\\4\end{bmatrix}. \end{equation*}
Multiplying on the left by \(A^T\) gives us the normal equations.
\begin{equation*} A^TA\mathbf{z}=A^T\mathbf{b} \end{equation*}
\begin{equation*} \begin{bmatrix}164\amp 0\amp 20\\0\amp 20\amp 0\\20\amp 0\amp 5\end{bmatrix}\mathbf{z}=\begin{bmatrix}66\\11\\4\end{bmatrix}. \end{equation*}
It turns out that \(A^TA\) is invertible, so it is easy to solve for \(\mathbf{z}\text{.}\) You can use technology to accomplish this. Feel free to use any online tool or Mathlab for this for practice. You arrive at the solution
\begin{equation*} \mathbf{z}=\begin{bmatrix}0.26\\0.20\\1.15\end{bmatrix}. \end{equation*}
Therefore, the quadratic function of best fit is given by \(f(x)=0.26x^2+0.2x+1.15\text{.}\) You can see the graph and the points shown below. Before the end of this section we will return to this problem with a more computationally efficient approach.

Example 10.5.17.

Given the data points \((-1, 0)\text{,}\) \((0,1)\text{,}\) and \((1,4)\text{,}\) find the least squares approximating function of the form \(f(x)=ax+b2^x\text{.}\)
Answer.
We are looking for an approximate solution to the system of equations
\begin{align*} a(-1)\amp + b(2^{-1})\amp = 0 \\ a(0)\amp + b(2^{0})\amp = 1 \\ a(1)\amp + b(2^{1})\amp = 4 \end{align*}
This corresponds to the matrix equation
\begin{equation*} \begin{bmatrix}-1\amp 1/2\\0\amp 1\\1\amp 2\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix}=\begin{bmatrix}0\\1\\4\end{bmatrix}. \end{equation*}
Using the normal equations, we obtain
\begin{equation*} A^TA\mathbf{z}=A^T\mathbf{b}, \end{equation*}
\begin{equation*} \frac{1}{4}\begin{bmatrix}8\amp 6\\6\amp 21\end{bmatrix}\mathbf{z}=\begin{bmatrix}4\\9\end{bmatrix}. \end{equation*}
Solving for \(\mathbf{z}\) yields
\begin{equation*} \mathbf{z}=\begin{bmatrix}10/11\\16/11\end{bmatrix}. \end{equation*}
Therefore, the function of best fit (of the given form) is given by
\begin{equation*} f(x)=\frac{10}{11}x+\frac{16}{11}(2^x). \end{equation*}

Subsection 10.5.4 \(QR\)-Factorization: A Quicker Way to do Least Squares

When solving the normal equations in (9.2.1) , it is advantageous to have a \(QR\)-factorization of \(A\text{.}\) For then we can write
\begin{align*} A^TA\mathbf{z}\amp = A^T\mathbf{b} \\ (QR)^T(QR)\mathbf{z}\amp = (QR)^T\mathbf{b} \\ R^TQ^TQR\mathbf{z}\amp = R^TQ^T\mathbf{b} \\ R^TR\mathbf{z}\amp = R^TQ^T\mathbf{b}. \end{align*}
Since \(R\) is invertible, then \(R^T\) also has an inverse, and multiplying on the left by it yields
\begin{equation*} R\mathbf{z} = Q^T b. \end{equation*}
This last equation is easily solved by back-substitution, since \(R\) is upper triangular. This greatly reduces the amount of computations we need to make, as we will observe by using Octave in our final example of the section.

Exercises 10.5.5 Exercises

Exercise Group.

In each case find the QR-factorization of \(A\text{.}\)
1.
\begin{equation*} A = \left[ \begin{array}{rr} 1 \amp -1 \\ -1 \amp 0 \end{array}\right]. \end{equation*}
Answer.
\begin{equation*} Q = \frac{1}{\sqrt{5}}\left[ \begin{array}{rr} 2 \amp -1 \\ 1 \amp 2 \end{array}\right], \quad R = \frac{1}{\sqrt{5}}\left[ \begin{array}{rr} 5 \amp 3 \\ 0 \amp 1 \end{array}\right]. \end{equation*}
2.
\begin{equation*} A = \left[ \begin{array}{rr} 2 \amp 1 \\ 1 \amp 1 \end{array}\right]. \end{equation*}
Answer.
\begin{equation*} Q = \frac{1}{\sqrt{3}}\left[ \begin{array}{rrr} 1 \amp 1 \amp 0 \\ -1 \amp 0 \amp 1 \\ 0 \amp 1 \amp 1 \\ 1 \amp -1 \amp 1 \end{array}\right], \quad R = \frac{1}{\sqrt{3}}\left[ \begin{array}{rrr} 3 \amp 0 \amp -1 \\ 0 \amp 3 \amp 1 \\ 0 \amp 0 \amp 2 \end{array}\right]. \end{equation*}
3.
\begin{equation*} A = \left[ \begin{array}{rrr} 1 \amp 1 \amp 1 \\ 1 \amp 1 \amp 0 \\ 1 \amp 0 \amp 0 \\ 0 \amp 0 \amp 0 \end{array}\right]. \end{equation*}
4.
\begin{equation*} A = \left[ \begin{array}{rrr} 1 \amp 1 \amp 0 \\ -1 \amp 0 \amp 1 \\ 0 \amp 1 \amp 1 \\ 1 \amp -1 \amp 0 \end{array}\right]. \end{equation*}

5.

If \(R\) is upper triangular and invertible, show that there exists a diagonal matrix \(D\) with diagonal entries \(\pm 1\) such that \(R_{1} = DR\) is invertible, upper triangular, and has positive diagonal entries.

6.

If \(A\) has independent columns, let \\ \(A = QR\) where \(Q\) has orthonormal columns and \(R\) is invertible and upper triangular. (Some authors do not require a \(QR\)-factorization to have positive diagonal entries.) Show that there is a diagonal matrix \(D\) with diagonal entries \(\pm 1\) such that \(A = (QD)(DR)\) is the QR-factorization of \(A\text{.}\)

Exercise Group.

In each case, find the exact eigenvalues and then approximate them using the QR-algorithm.
7.
\begin{equation*} A = \left[ \begin{array}{rr} 1 \amp 1 \\ 1 \amp 0 \end{array}\right]. \end{equation*}
Answer.
\begin{equation*} A_{1} = \left[ \begin{array}{rr} 3 \amp 1 \\ 1 \amp 0 \end{array}\right], \quad Q_{1} = \frac{1}{\sqrt{10}}\left[ \begin{array}{rr} 3 \amp -1 \\ 1 \amp 3 \end{array}\right] \end{equation*}
and
\begin{equation*} R_{1} = \frac{1}{\sqrt{10}}\left[ \begin{array}{rr} 10 \amp 3 \\ 0 \amp -1 \end{array}\right]. \end{equation*}
8.
\begin{equation*} A = \left[ \begin{array}{rr} 3 \amp 1 \\ 1 \amp 0 \end{array}\right]. \end{equation*}
Answer.
\begin{equation*} A_{2} = \frac{1}{10}\left[ \begin{array}{rr} 33 \amp -1 \\ -1 \amp -3 \end{array}\right], \quad Q_{2} = \frac{1}{\sqrt{1090}}\left[ \begin{array}{rr} 33 \amp 1 \\ -1 \amp 33 \end{array}\right], \quad R_{2} = \frac{1}{\sqrt{1090}}\left[ \begin{array}{rr} 109 \amp -3 \\ 0 \amp -10 \end{array}\right]. \end{equation*}

9.

If \(A\) is symmetric, show that each matrix \(A_{k}\) in the QR-algorithm is also symmetric. Deduce that they converge to a diagonal matrix.
Answer.
Use induction on \(k\text{.}\) If \(k = 1\text{,}\) \(A_{1} = A\text{.}\) In general \(A_{k+1} = Q_{k}^{-1}A_{k}Q_{k} = Q_{k}^{T}A_{k}Q_{k}\text{,}\) so the fact that \(A_{k}^{T} = A_{k}\) implies \(A_{k+1}^{T} = A_{k+1}\text{.}\) The eigenvalues of \(A\) are all real, so the \(A_{k}\) converge to an upper triangular matrix \(T\text{.}\) But \(T\) must also be symmetric (it is the limit of symmetric matrices), so it is diagonal.

10.

Apply the QR-algorithm to
\begin{equation*} A = \left[ \begin{array}{rr} 2 \amp -3 \\ 1 \amp -2 \end{array}\right]. \end{equation*}
Explain.

11.

Given a matrix \(A\text{,}\) let \(A_{k}\text{,}\) \(Q_{k}\text{,}\) and \(R_{k}\text{,}\) \(k \geq 1\text{,}\) be the matrices constructed in the QR-algorithm. Show that \(A_{k} = (Q_{1}Q_{2} \cdots Q_{k})(R_{k} \cdots R_{2}R_{1})\) for each \(k \geq 1\) and hence that this is a QR-factorization of \(A_{k}\text{.}\)
Hint.
Show that \(Q_{k}R_{k} = R_{k-1}Q_{k-1}\) for each \(k \geq 2\text{,}\) and use this equality to compute \((Q_{1}Q_{2} \cdots Q_{k})(R_{k} \cdots R_{2}R_{1})\) ``from the centre out.’’ Use the fact that \((AB)^{n+1} = A(BA)^{n}B\) for any square matrices \(A\) and \(B\text{.}\)

12.

Find the best approximation to a solution to the system of equations.
\begin{equation*} \begin{matrix}3x\amp +\amp y\amp +\amp z\amp =\amp 6\\2x\amp +\amp 3y\amp -\amp z\amp =\amp 1\\2x\amp -\amp y\amp +\amp z\amp =\amp 0\\3x\amp -\amp 3y\amp +\amp 3z\amp =\amp 8\end{matrix} \end{equation*}
Enter answers in fraction form below.
Answer.
\begin{equation*} x=\frac{-20}{12},\quad y=\frac{46}{12},\quad z=\frac{95}{12}. \end{equation*}

Exercise Group.

Find a linear function of best fit for each of the following sets of data points. Examine how well your line fits the points by typing the equation of the line into the Desmos window.
13.
\begin{equation*} (2,4), \ (4,3), \ (7,2), \ (8,1). \end{equation*}
Enter your answers in fraction form.
Answer.
\begin{equation*} f(x)=\frac{-6}{13}x+\frac{64}{13}. \end{equation*}
14.
\begin{equation*} (-2, 3), \ (-1,1), \ (0,0), \ (1, -2), \ (2, -4). \end{equation*}
Answer.
\begin{equation*} f(x)=-1.7x+-0.4. \end{equation*}

15.

Use Mathlab or another program/tool to find the least squares approximating quadratic function for the following data points.
\begin{equation*} (-2,1), \ (0,0), \ (3,2), \(4,3). \end{equation*}
Round your answers to three decimal places.
Answer.
\begin{equation*} f(x)=0.194x^2+-0.024x+0.127. \end{equation*}

16.

If \(A\) is an \(m \times n\) matrix, it can be proved that there exists a unique \(n \times m\) matrix \(A^{\#}\) satisfying the following four conditions: \(AA^{\#}A = A\text{;}\) \(A^{\#}AA^{\#} = A^{\#}\text{;}\) \(AA^{\#}\) and \(A^{\#}A\) are symmetric. The matrix \(A^{\#}\) is called the Moore-Penrose inverse.
  1. If \(A\) is square and invertible, show that \(A^{\#} = A^{-1}\text{.}\)
  2. If \(\text{rank} A = m\text{,}\) show that \(A^{\#} = A^{T}(AA^{T})^{-1}\text{.}\)
  3. If \(\text{rank} A = n\text{,}\) show that \(A^{\#} = (A^{T}A)^{-1}A^{T}\text{.}\) (Notice the appearance of the Moore-Penrose inverse arrived when we solve the normal equations, arriving at Equation (10.5.3)).