Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preprocess data for machine learning? [closed]

I just wanted some general tips on how data should be pre-processed prior to feeding it into a machine learning algorithm. I'm trying to further my understanding of why we make different decisions at preprocessing times and if someone could please go through all of the different things we need to consider when cleaning up data, removing superfluous data etc. I would find it very informative as I have searched the net a lot for some canonical answers or rules of thumb here and there doesn't seem to be any.

I have a set of data in a .tsv file available here. The training set amounts to 7,000 rows, the test set 3,000. What different strategies should I use for handling badly-formed data if 100 rows are not readable in each? 500? 1,000? Any guidlines to help me reason about this would be very much appreciated.

Sample code would be great to see, but is not necessary if you don't feel like it, I just want to understand what I should be doing! :)

Thanks

like image 964
Simon Kiely Avatar asked Jan 26 '14 22:01

Simon Kiely


1 Answers

There is a lot of things which need to be decided according to the actual data. It is not as simple as naming a few steps you need to do always when you get data.

However, I can try to name a few of things which usually help a lot. Still, the first and the most important thing is to thoroughly analyze the data and make your best to "understand them". Understanding data and all the background behind the crawling and collecting data is essential part. If you understand how it comes that there are missing data or noise then you can have a clue how to handle it.

I will try to give you a few hints, though:

  1. Normalize values - It is not always necessary to normalize all the features. But generally, normalization can't hurt and it can help a lot. Thus, if you are not limited, give it a try and try using normalization for all the features except of those which are clearly non-sense to be normalized. The most usual normalization methods are: linear normalization (mapping the feature values to <0,1> range) and z-normalization which means that you subtract the mean of the feature values and divide the result by the standard deviation. It is not possible to generally say which one is better. (we are getting back to the understanding the data)
  2. Missing values - It is necessary to decide what to do with missing values. There are a few ways how to handle it. Remove the samples with missing values. If you have enough data samples, perhaps it is not necessary to care about the samples with missing values. It may only bring a noise to your results. In the case, there is only one feature value missing in the sample, you can just fill the value by the mean of the feature. (but be careful because again, you can just bring the noise to the results)
  3. Outliers - In many cases, you will come across samples which are far away from other samples, i.e. the outliers. The outliers are usually just a noise, mistake in data or it can be a signal of a special behavior (e.g. when there is something which is violating the usual behavior pattern it can be a signal of actions caused by an attacker or something - e.g. bank networks). In most cases, it is good idea to just remove the outliers, as the number of outliers is usually really low and it would may have a big influence on your results. Considering a histogram as an example - I would just cut off let say 0-2.5 percentile and 97.5-100 percentile.
  4. Mistakes - It is very likely there will be mistakes in the data. This is the part where I can't give you any hints as it is necessary to really understand all the background and to know how it could have happened that there are mistakes.
  5. Nominal values - If there are any nominal values which can be ordered, then just replace the nominal values to numbers (0, 1, 2, 3, 4, and 5). If it is not possible to order the values (e.g. color = blue, black, green...) then the best way is to split the feature into as many features as the cardinality of the set of possible values. And just transform the feature to binary values - "Is green?" "Yes/No" (0/1).

Summary, it is really hard to answer generally. The good way how to avoid "making things worse" is to start with removing all the "bad values". Just remove all the rows with missing or wrong values. Transform all the other values as mentioned before and try to get your first results. Then you will get better understanding of all the data and you will have better idea where to look for any improvements.

If you have any further questions regarding particular "pre-processing problems", I will be happy to edit this answer and add more ideas how to handle it.

like image 160
Marek Avatar answered Sep 17 '22 18:09

Marek