\(\mathbf{Ax} \approx \mathbf{b}\)
How do we solve for \(\mathbf{x}\) in the equation \(\mathbf{Ax} \approx \mathbf{b}\)?
Approximation
So far we have seen the well behaved case of \(\mathbf{Ax} =\mathbf{b}\). What if \(\mathbf{b}\) is not in the column space of \(\mathbf{A}\)? Then \(\mathbf{Ax} \neq \mathbf{b}\). This is a typical scenario observed in engineering and the sciences. We don’t abandon the problem because it doesn’t admit an exact solution. Instead, we turn to a powerful weapon in our armory: approximations. Can we find a \(\mathbf{\widehat{x}}\) such that \(\mathbf{A}\widehat{\mathbf{x}} \approx \mathbf{b}\)? What do we mean by the symbol \(\approx\) here? We are dealing with two vectors on either side of the symbol. Let us first understand what the symbol means when we have scalars.
Let us look at the following statements:
- \(1.234\approx 1\)
- \(1.234\approx 1.2\)
- \(1.234\approx 1.23\)
These are three approximations. From experience, we know that the second approximation is better than the first and the third better than the second. If we plot these points on a real line, the goodness of an approximation can be interpreted using the distance between the original point and its approximation: smaller the distance, better the approximation.
This very idea is extended to an \(n\) dimensional vector space. If \(\mathbf{b}\) and \(\mathbf{\widehat{b}}\) are two vectors in \(\mathbb{R}^{m}\), then the Euclidean distance between them is a measure of the goodness of the approximation. To avoid taking square roots, we write down the expression for the squared distance:
\[ \begin{equation*} ||\widehat{\mathbf{b}} -\mathbf{b} ||^{2} =(\widehat{b}_{1} -b_{1} )^{2} +\cdots +(\widehat{b}_{m} -b_{m} )^{2} \end{equation*} \]
The key to solving the problem \(\mathbf{Ax} \approx \mathbf{b}\), is to find a vector \(\widehat{\mathbf{x}}\) such that \(\mathbf{A}\widehat{\mathbf{x}}\) is as close to \(\mathbf{b}\) as possible. This can be framed as an optimization problem:
\[ \boxed{ \widehat{\mathbf{x}} =\underset{\mathbf{x}}{\arg\min} \ \ ||\mathbf{Ax} -\mathbf{b} ||^{2} } \]
If you are seeing \(\arg\min\) for the first time, think about it like a function (in the programming sense):
- Find the value that minimizes the expression
- Return this value to the user
Let us remind ourselves about the dimensions once again:
- \(\mathbf{A}\) is a matrix of shape \(m \times n\)
- \(\displaystyle \mathbf{x} \in \mathbb{R}^{n}\)
- \(\displaystyle \mathbf{Ax} \in \mathbb{R}^{m}\)
- \(\displaystyle \mathbf{b} \in \mathbb{R}^{m}\)
Since both \(\mathbf{Ax}\) and \(\mathbf{b}\) belong to the same space \(\mathbb{R}^{m}\), we are justified in comparing them.
Objective Function
The function to be minimized is called the objective function:
\[ L(\mathbf{x}) =||\mathbf{Ax} -\mathbf{b} ||^{2} \]
Since \(||\mathbf{v}||^2 = \mathbf{v}^T \mathbf{v}\) for a vector \(\mathbf{v}\), we can also express \(L(\mathbf{x})\) as:
\[ L(\mathbf{x}) =(\mathbf{Ax} -\mathbf{b})^{T}(\mathbf{Ax} -\mathbf{b}) \]
For solving this problem it would be convenient to represent \(\displaystyle A\) as a collection of \(\displaystyle m\) rows. Representing the \(\displaystyle i^{\text{th}}\) row by the row vector \(\displaystyle \mathbf{a}_{i}^{T}\):
\[ \mathbf{A} =\begin{bmatrix} - & \mathbf{a}_{1}^{T} & -\\ & \vdots & \\ - & \mathbf{a}_{m}^{T} & - \end{bmatrix} \]
Therefore, \(\displaystyle \mathbf{Ax}\) would be the vector:
\[ \mathbf{Ax} = \begin{bmatrix} \mathbf{a}_{1}^{T}\mathbf{x}\\ \vdots \\ \mathbf{a}_{m}^{T}\mathbf{x} \end{bmatrix} \]
With this \(\displaystyle \mathbf{Ax} -\mathbf{b}\) becomes:
\[ \mathbf{Ax} -\mathbf{b} =\begin{bmatrix} \mathbf{a}_{1}^{T}\mathbf{x} -b_{1}\\ \vdots \\ \mathbf{a}_{m}^{T}\mathbf{x} -b_{m} \end{bmatrix} \]
With that:
\[ ||\mathbf{Ax} -\mathbf{b} ||^{2} =\sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right)^{2} \]
So far we have seen three different ways of expressing the objective function. We will freely move across these forms in upcoming sections.
\[ \begin{aligned} L(\mathbf{x}) & =||\mathbf{Ax} -\mathbf{b} ||^{2}\\ & \\ & =(\mathbf{Ax} -\mathbf{b})^{T}(\mathbf{Ax} -\mathbf{b})\\ & \\ & =\sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right)^{2} \end{aligned} \]
Least Squares and Linear Regression
Have we seen this problem anywhere before? This was exactly the problem that we were concerned with in linear regression! The variables used were a bit different. So let us spend a moment to reacquaint ourselves with linear regression. Recall that in linear regression, the optimization problem was:
\[ \underset{\mathbf{w}}{\min} \ (\mathbf{Xw} -\mathbf{y})^{T}(\mathbf{Xw} -\mathbf{y}) \]
Each row of \(\displaystyle \mathbf{X}\) has the features of a data-point. \(\displaystyle \mathbf{w}\) is the vector of weights and \(\displaystyle \mathbf{y}\) has the true labels. Let us put this side by side with the problem we have been discussing so far:
\[ \begin{aligned} \underset{\mathbf{w}}{\min} \ (\mathbf{Xw} -\mathbf{y})^{T}(\mathbf{Xw} -\mathbf{y})\\ \\\\ \underset{\mathbf{x}}{\min} \ (\mathbf{Ax} -\mathbf{b})^{T}(\mathbf{Ax} -\mathbf{b}) \end{aligned} \]
And they are one and the same if we make the following substitution!
\[ \mathbf{A}\rightarrow \mathbf{X} ,\ \mathbf{x}\rightarrow \mathbf{w} ,\ \mathbf{b}\rightarrow \mathbf{y} \]
Let us also take a moment to compare alternative forms of the objective function:
\[ \begin{aligned} L(\mathbf{w}) =\sum\limits _{i=1}^{m}\left(\mathbf{x}_{i}^{T}\mathbf{w} -y_{i}\right)^{2}\\ \\\\ L(\mathbf{x}) =\sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right)^{2} \end{aligned} \]
Proceeding with this form for the moment, let us focus on the linear regression objective. Representing \(\displaystyle e_{i} =\mathbf{x}_{i}^{T}\mathbf{w} -y_{i}\), we see that \(\displaystyle e_{i}\) is the error in prediction. So, \(\displaystyle L(\mathbf{w})\) is also called the sum of squared errors:
\[ L(\mathbf{w}) =\sum\limits _{i=1}^{m} e_{i}^{2} \]
This is called the least squares problem. The reasons are quite clear:
least, since it’s a minimization problem
squares, since we are minimizing the sum of the squared deviations
What seemed like a wild excursion into the world of a system of linear equations is in fact inextricably linked with the world of supervised learning. That is enough motivation to seek for a solution to the problem.
Normal Equations
Let us go back to the \(\displaystyle (\mathbf{A} ,\mathbf{x} ,\mathbf{b})\) notation and stick with this. Given that we have to minimize \(\displaystyle L(\mathbf{x})\) with respect to \(\displaystyle \mathbf{x}\), the first thing that we can try is to find the gradient of \(\displaystyle L\) with respect to \(\displaystyle \mathbf{x}\) and set it to zero. Recall that the gradient of a function is the vector of partial derivatives. Denoting it by \(\displaystyle \nabla L(\mathbf{x}) \in \mathbb{R}^{n}\), we have:
\[ \nabla L(\mathbf{x}) =\begin{bmatrix} \cfrac{\partial L(\mathbf{x})}{\partial x_{1}}\\ \vdots \\ \cfrac{\partial L(\mathbf{x})}{\partial x_{n}} \end{bmatrix} \]
To compute the partial derivatives, it would be convenient to use the following form of the objective:
\[ \begin{aligned} L(\mathbf{x}) & =\sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right)^{2} \end{aligned} \]
Let us first take one term in the summation without the square:
\[ \mathbf{a}_{i}^{T}\mathbf{x} -b_{i} \]
The gradient of this term with respect to \(\displaystyle \mathbf{x}\) is just the vector \(\displaystyle \mathbf{a}_{i}\):
\[ \nabla \left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right) =\mathbf{a}_{i} \]
Using chain rule, we can express the gradient of the squared term as:
\[ \nabla \left[\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right)^{2}\right] =2\cdot \left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right) \cdot \mathbf{a}_{i} \]
With that, the complete gradient is:
\[ \nabla L(\mathbf{x}) =2\cdot \sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right) \cdot \mathbf{a}_{i} \]
A slight rearranging of this can be helpful:
\[ \begin{aligned} \nabla L( x) & =2\cdot \sum\limits _{i=1}^{m}\left(\mathbf{a}_{i}^{T}\mathbf{x} -b_{i}\right) \cdot \mathbf{a}_{i}\\ & \\ & =2\cdot \sum\limits _{i=1}^{m}\left[\left(\mathbf{a}_{i}\mathbf{a}_{i}^{T}\right)\mathbf{x} -b_{i} \cdot \mathbf{a}_{i}\right]\\ & \\ & =2\cdot \left[\left(\sum\limits _{i=1}^{m}\mathbf{a}_{i}\mathbf{a}_{i}^{T}\right) \mathbf{x}-\sum\limits _{i=1}^{m} b_{i} \cdot a_{i}\right] \end{aligned} \]
If the rows of \(\displaystyle \mathbf{A}\) are the row vectors \(\displaystyle \mathbf{a}_{i}^{T}\) then the columns of \(\displaystyle \mathbf{A}^{T}\) would have the column vectors \(\displaystyle \mathbf{a}_{i}\). Using this observation, we have:
\[ \begin{aligned} \sum\limits _{i=1}^{n} b_{i} \cdot \mathbf{a}_{i} & =\mathbf{A}^{T}\mathbf{b}\\ & \\ \sum\limits _{i=1}^{n}\mathbf{a}_{i}\mathbf{a}_{i}^{T} & =\mathbf{A}^{T}\mathbf{A} \end{aligned} \]
Refer to the appendix on matrix multiplication to understand why these relations are valid if they aren’t immediately apparent. Plugging these into the equation for the gradient we have:
\[ \nabla L( \mathbf{x}) =2\cdot \left[\left(\mathbf{A}^{T}\mathbf{A}\right)\mathbf{x} -\mathbf{A}^{T}\mathbf{b}\right] \]
Setting the gradient to zero:
\[ \boxed{\left(\mathbf{A}^{T}\mathbf{A}\right)\mathbf{x} = \mathbf{A}^{T}\mathbf{b} } \]
This system is called the normal equations. Let us take a moment to consider the shapes of the objects involved:
\(\displaystyle \mathbf{A}^{T}\mathbf{A}\) has shape \(\displaystyle n\times n\)
\(\displaystyle \mathbf{x} \in \mathbb{R}^{n}\)
\(\displaystyle \mathbf{A}^{T}\mathbf{b} \in \mathbb{R}^{n}\)
This is a system of \(\displaystyle n\) linear equations in \(\displaystyle n\) unkowns. If the matrix \(\mathbf{A}^{T}\mathbf{A}\) is invertible, then we have the following solution:
\[ \begin{equation*} \widehat{\mathbf{x}} =(\mathbf{A}^{T}\mathbf{A} )^{-1}\mathbf{A}^{T}\mathbf{b} \end{equation*} \]
We still don’t know if this corresponds to a minimum. Rest assured that this is indeed a minimum. But we won’t take up the proof now. The other worrying part is what happens if the matrix \(\mathbf{A}^{T}\mathbf{A}\) is singular or non-invertible. It turns out that even in this situation, this system is always consistent. The proof of consistency is a bit involved. It has been moved to the appendix on the matrix \(\mathbf{A}^T \mathbf{A}\).
The final point concerns the nature of solutions to this system. One can show that the set of all solutions to the least squares system is given by:
\[ S=\mathbf{x}_{p} +\mathcal{N}(\mathbf{A}) \]
Here, \(\displaystyle \mathbf{x}_{p}\) is a particular solution. The nullspace never leaves us! From this, we can immediately infer that the least squares solution is unique if \(\displaystyle \mathcal{N}(\mathbf{A})\) is the trivial subspace \(\displaystyle \{\mathbf{0}\}\), which happens when \(\displaystyle \mathbf{A}\) has linearly independent columns.
Summary
When \(\mathbf{b}\) is not in the column space of \(\mathbf{A}\), we look for an approximate solution. One way to specify a good approximation is to find an \(\widehat{\mathbf{x}}\) that minimizes the following objective function:
\[ \begin{equation*} L(\mathbf{x}) = || \mathbf{Ax} - \mathbf{b}||^2 \end{equation*} \]
This \(\mathbf{\widehat{x}}\) is given by the solution to the normal equations:
\[ \begin{equation*} (\mathbf{A}^{T}\mathbf{A} )\widehat{\mathbf{x}} =\mathbf{A}^{T}\mathbf{b} \end{equation*} \]
This system always has a solution. In the next few units, we shall try to understand the least squares problem from the lens of geometry.