Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How could I change datatype to float64 so that sklearn can work on dataframe which has data greater than np.float32

In my data set there are few data(i.e. 1.4619664882428694e+258) which are greater than float32 max value(3.4028235e+38). Now during fitting the model I am getting the below error:

Input contains NaN, infinity or a value too large for dtype('float32').

I tried below code:

df_features = pd.read_csv('data\df_features.csv')
df_target = pd.read_csv('data\df_target.csv')

X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=.25, random_state=0)

model = AdaBoostRegressor()

try:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = r2_score(y_test, y_pred)
    print(acc)

except Exception as error:
    print(error)

How can I solve this problem if I want to use the real data without normalizing? Is there any option so that i can set default data type to float64 for sklearn. If so then how?

like image 613
BC Smith Avatar asked Nov 14 '19 13:11

BC Smith


1 Answers

It might not be a direct answer to the question but I think for practical purposes it should be adressed as a data science question.

First, a value of 1.4e258 feel rather suspicious as it is hard to picture where it could have a meaningfull physical signification. Such extreme values might screw your metrics and your model badly. The question should be about it being an outlier or not. The answer depends on your data, their meaning.

  • If it is an oultier (as in extreme value), the right approach might be to remove the instance entirely. This will probably boost the performance of your trained model on the rest of the instances. The downside is that the model won't perform well on that instance or similar extreme values. Practically, this requires that you make everyone using the model aware of its limitations on those extreme values.

  • If it is not an outlier you should consider modifying it to make it more informative, both to human and machine. That could mean using a more meaningful scale like a logarithmic one : it would be easier to manipulate for humans and might avoid significant calculation problems. Another approach would be to use some sort of renormalisation. For exemple if all your values are between 1e250 and 1e260 you could divide them by 1e255. If the variable that take such values depends on another one, you might renormalize by this value or one of its power, like if it is a volume, you might want to renormalise by a size variable to the 3rd power. That might help, both to avoid sklearn calculation problems and make more meaningful models.

like image 183
lcrmorin Avatar answered Oct 09 '22 18:10

lcrmorin