Linear Regression

NoteQuestion

What is a linear regression model?

Motivation

In this part, we will try to solve the regression problem that we introduced in the first part. Let us formally define the problem:

NoteProblem

Predict the selling price of a house given the data of \(\displaystyle n\) houses, where each house is represented by \(\displaystyle d\) features.

Consider two houses, one which has an area of \(1000\) square feet and the other which has an area of \(2000\) square feet. As the area of the house increases, the selling price is going to go up. Take another feature, say the distance from the nearest school. If the distance increases, then the selling price goes down. Perhaps the effect may not be as drastic. If we have access to only these two features, one function or model could be as follows:

\[ \begin{equation*} \text{Selling-price} =2\times \text{Area} -0.2\times \text{Distance} +\text{Constant} \end{equation*} \]

This is what is called a linear model. The values \(2\) and \(-0.2\) are called the coefficients or weights. The magnitude of a weight denotes the importance of the corresponding feature. Its sign denotes the effect it has on the output. For example, the distance feature is negatively correlated with the selling-price, but it is not as important as the area.

NoteRemark

We might be wrong about the choice of weights. Even worse, the relationship between the selling price and the two features may not even be linear! It is important to understand that a model is some approximation of the underlying reality. There is a nice quote that summarizes this idea:

All models are wrong, but some are useful.

Linear model

Generalizing this, let us say that we have a feature vector \(\mathbf{x}\) and a vector of weights, called the weight vector \(\mathbf{w}\). Recall that the housing data has six features:

\[ \begin{equation*} \mathbf{x} =\begin{bmatrix} x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ x_{6} \end{bmatrix} ,\ \mathbf{w} =\begin{bmatrix} w_{1}\\ w_{2}\\ w_{3}\\ w_{4}\\ w_{5}\\ w_{6} \end{bmatrix} \end{equation*} \]

The function or model that maps a data-point to the predicted label \(\hat{y}\) is:

\[ \begin{equation*} \hat{y}=x_{1} w_{1} +x_{2} w_{2} +x_{3} w_{3} +x_{4} w_{4} +x_{5} w_{5} +x_{6} w_{6} +\text{constant} \end{equation*} \]

We can rewrite the constant as one more weight, say \(w_{0}\):

\[ \begin{equation*} \hat{y}=x_{1} w_{1} +x_{2} w_{2} +x_{3} w_{3} +x_{4} w_{4} +x_{5} w_{5} +x_{6} w_{6} +w_{0} \end{equation*} \]

Going back to the vectors, we add a feature \(1\) to the feature vector and a \(w_{0}\) to the weights:

\[ \begin{equation*} \mathbf{x} =\begin{bmatrix} 1\\ x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ x_{6} \end{bmatrix} ,\mathbf{w} =\begin{bmatrix} w_{0}\\ w_{1}\\ w_{2}\\ w_{3}\\ w_{4}\\ w_{5}\\ w_{6} \end{bmatrix} \end{equation*} \]

If we now look at the expression for \(\hat{y}\), it is nothing but the dot-product of the two vectors:

\[ \begin{equation*} \begin{aligned} \hat{y} & =1\cdot w_{0} +x_{1} w_{1} +x_{2} w_{2} +x_{3} w_{3} +x_{4} w_{4} +x_{5} w_{5} +x_{6} w_{6}\\ & \\ & =\mathbf{w}^{T}\mathbf{x} \end{aligned} \end{equation*} \]

\(\mathbf{w}^{T}\mathbf{x}\) is the dot product between the two vectors \(\displaystyle \mathbf{w}\) and \(\mathbf{x}\). This is the notation that we will be using for the dot product from now on. This is the same as the matrix-product of a row-vector and a column-vector:

\[ \begin{equation*} \hat{y}=\mathbf{w}^{T}\mathbf{x} =\begin{bmatrix} w_{0} & w_{1} & w_{2} & w_{3} & w_{4} & w_{5} & w_{6} \end{bmatrix}\begin{bmatrix} 1\\ x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ x_{6} \end{bmatrix} \end{equation*} \]

The linear model is therefore expressed as:

\[ \begin{equation*} f(\mathbf{x}) =\mathbf{w}^{T}\mathbf{x} \end{equation*} \]

If \(\mathbf{x}\) is the feature vector of a house, then \(f(\mathbf{x} )\) will be its predicted selling price. We call \(\displaystyle \mathbf{w}\) the weight vector of the model. The elements of \(\displaystyle \mathbf{w}\) could also be called the parameters of the model.

Now that we understand a linear model, we can define the family of linear models:

\[ \begin{equation*} \mathcal{H} =\left\{f:f( x) =\mathbf{w}^{T}\mathbf{x} ,\mathbf{w} \in \mathbb{R}^{d}\right\} \end{equation*} \]

Learning problem

So much for one house. But we have \(n\) houses. We can write down \(\displaystyle n\) linear equations, one for each house. For the sake of brevity, we will just list down the equation for the \(\displaystyle i^{th}\) house.

\[ \begin{equation*} \begin{aligned} \hat{y}_{i} = w_{0} +x_{i1} \cdot w_{1} +x_{i2} \cdot w_{2} +x_{i3} \cdot w_{3} +x_{i4} \cdot w_{4} +x_{i5} \cdot w_{5} +x_{i6} \cdot w_{6} & \end{aligned} \end{equation*} \]

These \(\displaystyle n\) equations can be given a neat matrix representation:

\[ \begin{equation*} \begin{bmatrix} \hat{y}_{1}\\ \vdots \\ \hat{y}_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} & x_{1,5} & x_{1,6}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n,1} & x_{n,2} & x_{n,3} & x_{n,4} & x_{n,5} & x_{n,6} \end{bmatrix}\begin{bmatrix} w_{0}\\ w_{1}\\ w_{2}\\ w_{3}\\ w_{4}\\ w_{5}\\ w_{6} \end{bmatrix} \end{equation*} \]

Bringing in the data-matrix \(\displaystyle \mathbf{X}\) and the predicted label vector \(\displaystyle \hat{\mathbf{y}}\), we get the following compact equation:

\[ \begin{equation*} \mathbf{\hat{y}} =\mathbf{Xw} \end{equation*} \]

Recall that the loss function, SSE, measures the closeness between the predicted label and the true label. Denoting this by \(\displaystyle L(\hat{\mathbf{y}} ,\mathbf{y})\):

\[ \begin{equation*} L(\hat{\mathbf{y}} ,\mathbf{y}) =(\hat{y}_{1} -y_{1})^{2} +\cdots +(\hat{y}_{n} -y_{n})^{2} \end{equation*} \]

More compactly:

\[ \begin{equation*} L(\hat{\mathbf{y}} ,\mathbf{y}) =(\mathbf{\hat{y}} -\mathbf{y})^{T}(\mathbf{\hat{y}} -\mathbf{y}) \end{equation*} \]

Though we have expressed the loss function above as a function of the predicted and true label vectors, it is more appropriate to represent it as a function of the weight vector \(\displaystyle \mathbf{w}\). To see this, note that in a machine learning context, the dataset \(\displaystyle (\mathbf{X} ,\mathbf{y})\) is our input and we are in search of the function \(\displaystyle f( x) =\mathbf{w}^{T}\mathbf{x}\) that is parameterized by \(\displaystyle \mathbf{w}\). So \(\displaystyle \mathbf{w}\) is the unknown here. Armed with this knowledge, we can rewrite the loss function as:

\[ \begin{equation*} L(\mathbf{w}) =(\mathbf{Xw} -\mathbf{y})^{T}(\mathbf{Xw} -\mathbf{y}) \end{equation*} \]

The practical goal in front of us is to minimize the loss function:

\[ \begin{equation*} \underset{\mathbf{w}}{\min} \ L(\mathbf{w}) \end{equation*} \]

Denoting \(\displaystyle \mathbf{w}^{*}\) as the weight vector that minimizes this loss function, we can express this as:

\[ \begin{equation*} \mathbf{w}^{*} =\underset{\mathbf{w}}{\arg\min} \ \ L(\mathbf{w}) \end{equation*} \]

Warning

Recall that \(\displaystyle (\mathbf{X} ,\mathbf{y})\) is the training dataset. Never forget that minimizing the loss on the training dataset without any consideration on what would happen to the performance on the test dataset is not really the correct approach.

We will pause here and pursue the solution to this problem in a later section once we have understood some of the essential ideas in linear algebra.

Conclusion

And brings us to the end of a brief introduction to machine learning. The aim of this introductory section was to give some motivation as to why we are interested in the study of linear algebra in a course on data-science. It has perhaps also convinced us of the need to study optimization. Probability has been conspicuous by its absence. But we will return to that later.

Moving forward, we will courageously embrace the abstractions of linear algebra without making any references to ML. The connecting links between ML and the math behind ML is not possible at all times. Whenever we feel overwhelmed by the abstractions of linear algebra, it is sufficient to take a step back and think about a matrix as some dataset and matrix algebra as transformations of this dataset. Though this is not always the case, this mindset provides some sort of an anchor when we get lost.