Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split

I'm doing this tutorial on machine learning in which the following code is used:

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.data.csv')
df.replace('?', -99999, inplace = True)
df.drop(['id'], 1, inplace = True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

X_train, X_test, y_test, y_train = train_test_split(X, y)

Here is a sample from the csv file:

id,clump_thickness,unif_cell_size,unif_cell_shape, marg_adhesion,
single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli, mitoses,class
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2

When looking at the results from sklearn.model_selection.train_test_split I found out something weird (at least to me). If I run

    print(type(y_test[0]))
    print()
    print(type(X_train[:,1][0]))

I get the following output:

<class 'numpy.int64'>
<class 'int'>

Somehow the values in X_train are of the type int and the values in y_test are of the type numpy.int64. I don't know why train_test_split does this - I don't think it has to do with the data that is being split up - and the documentation doesn't seem to mention it either.

Since I want the values in y_test to be regular integers as well, I tried changing the type of y_test with astype(). Unfortunately, the following code

y_test = y_test.astype(int)
print(type(y_test[0]))

returns

<class 'numpy.int64'>

Question: Why does train_test_split return arrays containing values with different kinds of datatypes? Why am I not able to convert the values in y_test to integers?

Edit: The difference in type is caused by the data. If I run

 print(type(X[:,1][0]))
 print(type(y[0])) 

I get

<class 'int'>
<class 'numpy.int64'>

I still would like to know why astype doesn't work though!:)

like image 635
Mr. President Avatar asked Oct 18 '18 12:10

Mr. President


1 Answers

To convert numpy values to python types, there's numpy.ndarray.item

y_test_int = [v.item() for v in y_test]
print(type(y_test_int[0]))
#<class 'int'>
like image 183
STJ Avatar answered Oct 04 '22 02:10

STJ