Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.

As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.

My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.

In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line

"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97

I have around 900K lines for learning and I do my test over 100K lines

As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.

I tried several things:

  1. LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
  2. OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
  3. StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
  4. FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
  5. DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
  6. partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs

I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.

I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.

Is there any way I didn't explore that can fit my needs?

Thanks for any clue and piece of advice.

like image 927
RPresle Avatar asked Nov 09 '22 16:11

RPresle


1 Answers

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

like image 125
Erik Avatar answered Nov 14 '22 21:11

Erik