Datasets

Question

What is a dataset and why is it important?

Datasets

There are different kinds of datasets. The housing dataset that we saw right at the beginning is a tabular dataset. Data comes in the form of a table. Each column of this table is called an attribute or a feature and each row represents one record or observation. Recall that we also use the term data-point to refer to each row of the table. Tabular datasets form some of the most commonly seen datasets in machine learning. Tabular data can be neatly packed into comma-separated files or CSVs. Google sheets and Microsoft Excel are popular tools to manipulate tabular datasets.

Apart from tabular datasets, some other popular types of datasets:

image
text
speech

Image, text and speech data cannot be packed into simple CSVs and are often called unstructured data.

Whence comes data?

How do we obtain data? Where does data come from? This seems like a simple question but it doesn’t have a simple answer. Here are some scenarios that are arranged in increasing order of complexity:

Scenario-1

An FMCG company has given you some historical data concerning its sales over the last three years. It wants you to predict the average sales in the coming quarter.

Here we are lucky. Someone comes to our doorstep and gives us the data. It might be the case that the company has neatly arranged the data in a tabular format. In addition, we also have a very precise definition of the problem statement. We have to predict a real number by looking at the data. It is a regression problem.

Scenario-2

Twitter is developing an algorithm to detect tweets that contain offensive content. As a data scientist, you are given a dump of one million tweets and asked to develop an algorithm to solve the problem.

This is a more challenging problem compared to scenario-1. First, this is an instance of what is called a binary classification problem. Instead of predicting a real number, we have to predict one of two (binary) outcomes for each tweet:

offensive
not-offensive

In order to train a computer to distinguish between the two kinds of tweets, we need to give it examples of tweets of both kinds. Unfortunately, we don’t have that information. If that information is absent, how can we teach the computer to differentiate between the two? So, the first task here is to get the dataset labeled. That is, for each tweet, we need to mark it as “offensive” or “not-offensive”. This process is time consuming and requires considerable manpower, especially if the dataset is large.

Scenario-3

You are a research scientist at a manufacturing company. You want to set up a facility that automates the segregation of defective products from non-defective ones. Come up with an end-end ML solution.

This is by far the most challenging scenario. We don’t have access to the data. We need to gather data in the first place. Once we have the data, we need to label it or annotate it. Only then can we start thinking about training machines using the data.

Supervision

Labeling a dataset is an important part of the data preparation process. However, there may be situations where labeling is not practically feasible. In such cases, we have to settle with unlabeled data. Therefore, datasets in ML can be classified into two categories:

labeled dataset
unlabeled dataset

Techniques that work with labeled data fall under the category of supervised learning. Those that work with unlabeled data come under unsupervised learning. What is so special about the term “supervised”?

Cambridge dictionary defines the verb supervise as follows: to watch a person or activity to make certain that everything is done correctly, safely. By a slight extension of this definition, we could say that a supervisor is a teacher who tells us whether we are right or wrong. In this sense, the label performs the role of a supervisor for the machine as it is learning. With unlabeled data, there is no supervision available.

Partitioning the dataset

As humans, how do we know if someone has learnt a skill or not? Some kind of an examination is the mechanism that we have converged upon. Exams are so ubiquitous that we often conflate learning with scoring well in exams. However, for a machine, getting a good score in an exam is a good enough proxy for learning. For almost every skill that we can think of, there is some test or exam to evaluate our competency in that skill.

Let us now try to understand the context surrounding an exam better with the help of a human anology.

Three-digit Addition

A primary school teacher has taught three-digit addition to her students. To know if the kids have learnt this tak, she decides to conduct a test that has problems on three digit addition.

An important feature of testing is to make sure that it is at the right level of difficulty for the learners. If the teacher asks the same questions that are there in the math textbook from which she teaches, kids might score high marks. It is quite likely that the kids might memorize the answers. This could even be expected if the number of problems in the textbook is small. So the job of the teacher is to ask problems that use different numbes, those that are not present in the textbook. However, she has to be careful not to step outside the confines of the concept she is testing. In the interest of making the test challenging, she should not ask questions on five-digit addition. That would be unfair.

Train-test split

Back to teaching machines to learn from data, a similar spirit is to be employed when we test them. The dataset we use to train a machine, the textbook, is called the training dataset. The dataset on which we test the machine, the exam, is called the test dataset. In a real world situation, we won’t be given the training and test datasets neatly split into two parts. We would have to begin by partitioning the dataset into two parts. This is called the train-test split:

train-dataset
test-dataset

It is a good practice to shuffle the dataset thoroughly before coming up with this partition. A rule of thumb is to use something like 70% of the original dataset for training and the rest for testing.

The machine has access to only the train-dataset during the learning phase. Once the learning process is complete, the machine is evaluated on the test-dataset. The test-dataset is sacred in any ML problem. It should be kept hidden and used only at the end. This is analogous to the effort taken by the administration of colleges and universities to seal exam papers and keep them secure until the day of the examination. If the exam paper somehow gets leaked, the exam can no longer be conducted in a fair manner!

Validation

We train the machine on the train-dataset and evaluate its performance on the test-dataset. But often, we don’t stop with a partition having two parts, we go for a three-way split:

train-dataset
validation-dataset
test-dataset

Going back to the analogy, we can think about the validation-dataset as additional problems for practice or a mock exam that helps the trainer, the human, get a sense of how well the machine is likely to perform on the real test. As we can’t access the test-dataset during the learning stage, the validation dataset acts as a proxy.

Summary

Datasets come in different types: tabular data, image, text, speech data and so on. The source of data varies from situation to situation. Sometimes the data could be given to us in a well formatted and usable condition. At other times, we would have to expend effort in gathering data and making it suitable for further processing. Datasets could either be labeled or unlabeled. ML algorithms that deal with labeled data are called supervised learning methods. To evaluate the performance of any machine, it is important to partition the data into two parts: train, test; the machine is trained on the training data and evaluated on the test data.