Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why too many features cause over fitting?

I have been searching for night and no resolution.

Too many features cause too many parameters,but what's the relationship between the number of parameters and the wiggly curve?

like image 492
Guan Avatar asked Jun 12 '16 16:06

Guan


4 Answers

In machine learning, you split your data into a training set and a test set. The training set is used to fit the model (adjust the models parameters), the test set is used to evaluate how well your model will do on unseen data.

Overfitting means your model does much better on the training set than on the test set. It fits the training data too well and generalizes bad.

Overfitting can have many causes and usually is a combination of the following:

  • Too powerful model: e.g. you allow polynomials to degree 100. With polynomials to degree 5 you would have a much less powerful model which is much less prone to overfitting
  • Not enough data: Getting more data can sometimes fix overfitting problems
  • Too many features: Your model can identify single data points by single features and build a special case just for a single data point. For example, think of a classification problem and a decision tree. If you have feature vectors (x1, x2, ..., xn) with binary features and n points, and each feature vector has exactly one 1, then the tree can simply use this as an identifier.
like image 91
Martin Thoma Avatar answered Oct 22 '22 01:10

Martin Thoma


Having a lot of features is pretty much like having a lot of dimensions. Effectively it means your data is more sparse, so it's a lot more likely you end up drawing a conclusion that isn't warranted.

Imagine you have to decide how long a ruler needs to be, because you're selling them in a shop. If the only dimension is length, you might be able to get away with making 5 or 6 different rulers and seeing what sells.

Now imagine you are deciding what size of box to sell. Now you've got 3 dimensions. If 5 different sizes were enough to test in the single dimension, maybe you now need 5*3 = 125 different sizes. If your data only has say 20 different boxes, you might come to the wrong conclusion about what size people want.

Luckily, you may be able to reduce the dimensionality. For instance, if the box can be turned on its side (say it's a moving box, and you just need the bottom to not fall out), you might find there's only really 2 dimensions people care about.

like image 20
Carlos Avatar answered Oct 21 '22 23:10

Carlos


I had exactly the same question and since I wasn't able to really understand the reason based on the existing answers, I made some additional search and thought about it for a while. Here is what I found. Feel free to correct me where I'm wrong.

The main reason for overfitting is sparse data (for a given model).

The three reasons that are mentioned in this answer could be narrowed down to "sparse data" for your given problem. This is an important concept to understand since the sparsity of the data depends on the number of features.

It is always easier to understand any concept if you think of it in its simplest form and find a way to visualize it.

So let's see first how sparse data can cause overfitting using a two dimensional plot for a problem with one parameter. If your hypothesis is a high order polynomial and the number of data points small, then it would just overfit to the data. If you had more data points though it wouldn't overfit since it would have to minimize the average error from many data points (more than what it could overfit to) and that would cause it to pass from "the middle". enter image description here

Now, suppose we have another problem for which apparently, data is not sparse. enter image description here

Later though we learn that there is an additional feature in this same problem. This would mean that each data point of the existing training set has also a second value that describes it, the value of the second feature. If we try to plot it now, we might see something like this: enter image description here

This means that what we were seeing before, was just the projection of the data points in the y,θ1 plane and that was the reason for which we mistakenly assumed that we had enough data points. Now that we see the problem in its entirety, we can say that the data points are not enough, that data is sparse.

Thus, by adding one additional feature we expanded the space of our problem adding one more dimension to it and the data points which are part of this space, were expanded with it.

So if we try to fit a hypothesis to this data we might get something like this, which is probably an overfit.

enter image description here

If we had more data points though we could end up with something like this: enter image description here

In conclusion, adding more features expands the hypothesis space making the data more sparse and this might lead to overfitting problems.

like image 45
dimyG Avatar answered Oct 21 '22 23:10

dimyG


This is just an example, in general - in order to fit perfectly very complex dataset (noisy one) you need very "wigly" curve (as your functions are usually smooth). This is not true that it will always look like this, given specific class of approximators you can get different phenomena, like "spiky" function etc. the point is - more tunable parameters, more complex the function, and as you have limtied training set where actual value of the function is specified - function can take any shape outside training set, that's why it was drawn as "wigly".

However, it does not work the other way around. If you have very few parameters you still can overfit, and have "wigly" function, consider for example

f(x) = cos(<w, x>)

with big enough norm of w, you can make cosine arbitary "dense" thus fit to nearly any -1, +1 labeling of the data

like image 32
lejlot Avatar answered Oct 21 '22 23:10

lejlot