Machine learning: How much data do learning methods require?

Dr. Jonas Steeger

July 17, 2018

Machine learning is on everyone's lips and is often mentioned in the same breath as big data. But how much training examples are needed to enable machine learning?

Data, data, data.... is that enough?

The age of digital data has long since begun. Especially in companies, there are hardly any activities that do not involve a lot of data. But the sheer mass of data does not always mean that machine learning can be used. The data must also have a certain quality.

Data quality > Data mass

As is so often the case in life, the quality of the available data has to be right as well as the quantity. But the quality depends on what you want to do with the data.

A small example

A car dealership offers 10,000 window repairs a year. Every offer is a little different. The type of windscreen, the type of damage, the repair material used, the duration of the order, the mechanic, the price and the time of the repair all differ.

The Business Intelligence Unit wants to find out whether it is possible to determine how likely it is that the offer will be accepted when the offer is submitted.

In our example, the figure "10,000" sounds like quite a lot. But what do you do if 9,990 offers are accepted? The comparison group of rejected offers is then very small and unlikely large enough to make predictions with it.

Unfortunately, there is no simple answer to the question asked at the beginning. The amount of data required depends not only on the number of properties - the dimensionality - of the data, but also on the structure and distribution of the data.

Depending on the learning methods used, you will need a good or very good data situation. It never really works with the very bad ones.

You can calculate it at least theoretically

At least the good old Computational Learning Theory offers a little help. Here, ideal learning methods are assumed to allow for qunantisizing the required minimum amount of training data. With just a few (but here and there complicated) steps it can be determined how many training cases are needed for an optimal learning procedure. The problem: you simply cannot find the optimal conditions in practice.

There has to be an answer, right?

I'm afraid not. But the rule is: You don't have to start with less than 50 data points. But often 50 observations are enough to develop a feeling for the data structure. That's a lot of value. Because then you can think about which data you need and what you have to do for it. As a rule, however, you need far more than 50 observations. Our experience shows that everything over 1,000 goes in the right direction. But we have also seen problems where even 1,000,000 data points were just not enough.

There are some practical experiences that will help you assess whether you need much or little observation:

The more intuitive your hypothesis, the less data you need
The rarer the event, the more data you need.
The more properties your data has, the more data you need.
The more model parameters your learning model has, the more data you need.
Non-linear relationships need more data

Number tricks help to determine whether you have many or few data points

If the little help above is not enough for you, you can also use a few tricks:

Factor of the number of groups examined: For each group there must be X independent examples, where X should be hundred or thousand (e.g. 500, 5000 etc.) So if you want to compare two groups, you should have at least 500x2 = 1,000 data points.
Factor of the number of properties: There must be X% more examples than there are data properties, where X should be a hundred (e.g. 500). So if you examine an object that has three properties (e.g. size, color and price), you should have at least 3x500% = 1.5 times as much data as determined in step one: 1.5x1,000 = 1,500.
Factor of the number of model parameters: For each parameter in the model, there must be X independent examples, where X should be tens (e.g. 10, 20, 30, etc.). If you now build your model and the model has three parameters again, then you should have e.g. 3x10% = 30% more data points than defined in step 2. So 1.500x1.3 = 1.950.

If you always end up in the upper area with the three little tricks, then it could work well with your model.

What to do if there is no answer?

The simple answer is to just start. You just have to start. In most cases you cannot do more in advance. But you should not concentrate on one model and one problem and one amount of data! Diversification is the magic word here. Therefore, it is not uncommon for you to start several attempts at the same time. But then good project management is required - and of course Falcon can help you! Interested? Write to us via info@nordantech.com.