ML Problem

Question

What does a typical ML problem look like?

Analogy

Let us continue with the analogy of three-digit addition. Earlier, we were concerned with testing a machine’s performance. Now, we will turn to the machine itself and see what it means to learn something. Think about arithmetic classes in primary school. During the class hours, a student looks at solved examples in a textbook and learns how to solve simple three digit addition problems. Let us say that her textbook has the following problems along with the answers:

\(103+205=308\)
\(123+409=532\)
\(185+483=668\)

During the instructional hours, the student has access to both the questions and the answers. In the exam, she will not have access to the answers. But more importantly, she will not even be asked the same questions! So, just memorizing the answers will not help. She would have to learn how addition works. She needs to have a mental model of addition. In other words, she would have to learn a function from the input (question) to the output (answer).

This is exactly what happens in a typical ML problem, regression in our case. The input is a data-point, the vector of features describing a house. The output is a real number, the selling price of the house, which is the called target or label. The machine has to learn the mapping from input to output. Once this mapping or function is learnt, the machine can then be used to predict the output on unseen inputs. The collection of data-point and label pairs is the labeled dataset. The machine makes use of this dataset to learn a function. A labeled dataset is nothing but the textbook problems in our analogy.

This description is somewhat incomplete and we will fix the gaps in the upcoming sections.

Data Representation

We are given a collection of \(n\) data-points and \(n\) labels. Each data-point is described by \(d\) features. For example, in the housing dataset, each house is a data-point and is described by features such as latitude, longitude, area and so on. These features are clubbed together in a feature-vector of size \(d\). Arranging the \(n\) data-points in a matrix, we get a \(n\times d\) data-matrix. Let us call this matrix \(\mathbf{X}\):

\[ \begin{equation*} \mathbf{X} =\begin{bmatrix} x_{11} & - & x_{1d}\\ | & x_{ij} & | \\ x_{n1} & - & x_{nd} \end{bmatrix} \end{equation*} \]

Each row of this matrix is the feature vector for one data-point. The element \(x_{ij}\) is the \(j^{th}\) feature of the \(i^{th}\) data-point. The labels can be put together in a vector of size \(n\). Let us call this \(\mathbf{y}\):

\[ \begin{equation*} \mathbf{y} =\begin{bmatrix} y_{1}\\ \vdots \\ y_{n} \end{bmatrix} \end{equation*} \]

The pair \((\mathbf{X}, \mathbf{y})\) is called the labeled dataset. It is also common to represent the labeled dataset as a set of pairs. Denoting the set as \(\mathcal{D}\), we have:

\[ \mathcal{D} = \{(\mathbf{x}_1, y_1), \cdots, (\mathbf{x}_n, y_n)\} \]

Model

We have been using the generic term “machine” so far. A more technical term in this context is a model. A regression model can be thought of as a machine that attempts to encode the relationship between the input features of a house and its selling price. To be mathematically precise, a regression model can be viewed as a function that transforms a data-point into a label. Formally:

\[ \begin{equation*} f:\mathbb{R}^{d}\rightarrow \mathbb{R} \end{equation*} \]

Each feature-vector is of size \(d\). So, the feature-vectors reside in the \(d\) dimensional space \(\mathbb{R}^{d}\). The labels are real numbers, so they reside in \(\mathbb{R}\). Mathematically, this is the action of a model on a data-point \(\mathbf{x}\):

\[ \begin{equation*} \mathbf{x} \rightarrow f(\mathbf{x} ) \end{equation*} \]

Pictorially:

flowchart LR
  A[Feature Vector] --> B[Model]
  B[Model] --> C[Label]

What is so special about an ML model? All models take some input and produce a corresponding output. The key difference is that, unlike in a classical programming setting where we are given the input and the function and are asked to find the output, in machine learning, we are given both the input and the output using which have to learn a model \(f\). The function or the model is the unknown. This is what has to be learnt. This is one reason machine learning is often associated with the phrase “learning from data”.

Predicted label and true label

We represent the model as a function from feature-space to the label space: \(\displaystyle f:\mathbb{R}^{d}\rightarrow \mathbb{R}\). \(\displaystyle f(\mathbf{x})\) belongs to the label space, \(\displaystyle \mathbb{R}\) in this case, but it would be inaccurate to call it the label corresponding to the feature-vector \(\displaystyle \mathbf{x}\). It is the predicted label, that is, we are using \(\displaystyle f\) to predict the selling price for a house whose features are stored in the vector \(\displaystyle \mathbf{x}\). To distinguish this predicted selling price from the actual selling price \(\displaystyle y\), we call \(\displaystyle y\) the true label. To keep the notation clean, we refer to the predicted label as \(\displaystyle \hat{y}\):

\[ \begin{equation*} \hat{y} =f(\mathbf{x}) \end{equation*} \]

This distinction is important. So let us summarize all this:

Labels

\(\displaystyle (\mathbf{x} ,y)\) is one instance in the dataset
- \(\displaystyle \mathbf{x}\) is the feature-vector or the data-point and
- \(\displaystyle y\) is the true label.
\(\displaystyle \hat{y}\) is the predicted label and given by \(\displaystyle \hat{y} =f(\mathbf{x})\)

Performance metric

Let us restate the problem we are trying to solve:

Problem

Given the features of a house can we train a machine to predict its selling price?

We have reduced this to the problem of searching for a function \(\displaystyle f\). How do we know if we have achieved our goal? What is a “good” function? How do we define “good”?

Going back to our exam analogy, \(\displaystyle y\) refers to the actual answer to the question \(\displaystyle \mathbf{x}\) and \(\displaystyle \hat{y} =f(\mathbf{x})\) is the answer given by the machine. One measure of the goodness of \(\displaystyle f\) is the closeness of these two answers. Since we are only interested in the difference, we have:

\[ \begin{equation*} |\hat{y} -y| \end{equation*} \]

We call \(\displaystyle |\hat{y} -y|\) the error in prediction. The smaller the error, better the prediction. This is for one house. For \(\displaystyle n\) houses, we can accumulate the errors by summing them up:

\[ \begin{equation*} |\hat{y}_{1} -y_{1} |+\cdots +|\hat{y}_{n} -y_{n} | \end{equation*} \]

For reasons that will become clear soon, we prefer to use the square of the differences instead:

\[ \begin{equation*} (\hat{y}_{1} -y_{1})^{2} +\cdots +(\hat{y}_{n} -y_{n})^{2} \end{equation*} \]

This quantity is called the Sum of Squared Errors (SSE). SSE is an example of an error function or a loss function. The loss function is a meaure of how well the model fits a given dataset. With this, we have a clear task in front of us: find a function \(\displaystyle f\) that will give us the smallest SSE.

However, recall that we have two datasets: train and test. Getting a good score on the train dataset, though a good start, is not the ultimate goal. The real deal is to do well on the exam, the test dataset. Thus, we can put out a tentative statement of our goal as follows:

Tentative Goal

Learn a function \(\displaystyle f:\mathbb{R}^{d}\rightarrow \mathbb{R}\) using the train-dataset that will achieve the smallest error on the test-dataset.

We now have an optimization problem on our hands.

Learning

ML is all about learning from data. But who or what is learning? More importantly, who or what enables the learning? There is a learning algorithm which drives the learning. We can think of the model as the outcome of the learning process. During the learning stage, the train-dataset is fed as input to a learning algorithm, which in turn outputs a model.

flowchart LR
  A[Labeled Dataset] --> B[Learning Algorithm]
  B[Learning Algorithm] --> C[Model]

There is one important detail that is missing in this diagram. There are several models that we could choose from. Going back to our analogy, there are different ways to understand three digit addition:

representing numbers as counts of physical objects
representing numbers as abstract entities that can be manipulated

Teachers might choose the first model to help kids understand addition. We all would have come across this example at some point: “If I have two chocolates in my left hand and five in my right hand, how many do I have in total?” As kids grow up, teachers might move to the second model, which is more sophisticated.

Something similar happens in ML. We have to guide the machine in choosing the right model. However, the space of all possible models is too vast and this turns out to be a really hard problem to solve computationally. To keep things more tractable, we restrict the search space to a certain family of models and present this restricted class of models to the machine so that it can pick the best one from this family.

flowchart LR
  A[Labeled Dataset] --> B[Learning Algorithm]
  D[Model Family] --> B[Learning Algorithm]
  B[Learning Algorithm] --> C[Model]

Thus there are two inputs to the learning algorithm:

labeled train-dataset, \(\mathcal{D}\)
family of models, \(\mathcal{H}\)

\(\mathcal{D}\) and \(\mathcal{H}\) are symbols that are often used for the labeled dataset and the family of models. We can now refine the goal as follows:

Ideal Goal

Choose a function \(\displaystyle f:\mathbb{R}^{d}\rightarrow \mathbb{R}\) from the the model family \(\mathcal{H}\) using the train-dataset \(\mathcal{D}\) that will achieve the smallest error on the test-dataset.

ML scientists have come up with a variety of models and model families. The simplest such class is the family of linear models. We shall take this up in subsequent chapters.

Remark

For some regression models, once you have learnt the model, you can throw away the dataset (textbook). This is not true of all regression models though! Think about how you learnt three-digit addition. Do you still carry your primary school textbooks around? No! Your mind has a representation of what addition is. This representation is what we call a model.

Test is hidden

The goal defined in the previous section is an ideal scenario. There is one serious difficulty in the way of practical implementation. We don’t have access to the test dataset! The test dataset is hidden from our view during training. But this needn’t be cause for panic. Going back to our exam analogy, students are faced with a similar situation. They don’t know the exact problems that they will be given during the exam. One way to ensure that they do well in the final exam, irrespective of the actual problems thrown at them, is to make the best use of the resources they have and that is the textbook in their hands.

Getting back to machine learning, something that we do have in hand is the train-dataset. So we try to find a model from the family that gives the smallest error on the train-dataset. Alas, this method is not a panacea either! What if the model so chosen memorizes the train-dataset, like our naïve primary school kids? This is a valid concern. Trying to address this will take us too deep into the territory of machine learning. Let us then pause at this stage and try to restrict out attention to the problem of choosing the best model that minimizes the error on the train-dataset.

Practical Goal

It looks like we have circled back to our tentative goal stated at the beginning. We had to do this so as to restrict the scope of our discussion to the mathematical foundations of machine learning.

Warning

Let us never forget that this is only a temporary expedient and when we take up a full course on machine learning, this crucial point will receive its due consideration.

Summary

Regression is a typical example of an ML problem that uses labeled data to learn a mapping or a function from a feature-vector to a real number. The data-points are arranged in a data-matrix called \(\mathbf{X}\) and the label vector \(\mathbf{y}\). \((\mathbf{X}, \mathbf{y})\) is called the labeled dataset. A model is a function \(f:\mathbb{R}^{d} \rightarrow \mathbb{R}\) that transforms a feature-vector to a label. \(\hat{y} = f(\mathbf{x})\) is the predicted label, while \(y\) is called the true label. SSE is a metric that measures the goodness of the function \(f\). The goal in a regression problem is to choose a function \(f\) from a model family using the train-dataset that returns the smallest SSE on the test-dataset.