I'm doing this tutorial on machine learning in which the following code is used:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.data.csv')
df.replace('?', -99999, inplace = True)
df.drop(['id'], 1, inplace = True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])
X_train, X_test, y_test, y_train = train_test_split(X, y)
Here is a sample from the csv file:
id,clump_thickness,unif_cell_size,unif_cell_shape, marg_adhesion,
single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli, mitoses,class
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
When looking at the results from sklearn.model_selection.train_test_split
I found out something weird (at least to me). If I run
print(type(y_test[0]))
print()
print(type(X_train[:,1][0]))
I get the following output:
<class 'numpy.int64'>
<class 'int'>
Somehow the values in X_train
are of the type int
and the values in y_test
are of the type numpy.int64
. I don't know why train_test_split
does this - I don't think it has to do with the data that is being split up - and the documentation doesn't seem to mention it either.
Since I want the values in y_test
to be regular integers as well, I tried changing the type of y_test
with astype()
. Unfortunately, the following code
y_test = y_test.astype(int)
print(type(y_test[0]))
returns
<class 'numpy.int64'>
Question: Why does train_test_split
return arrays containing values with different kinds of datatypes? Why am I not able to convert the values in y_test
to integers?
Edit: The difference in type is caused by the data. If I run
print(type(X[:,1][0]))
print(type(y[0]))
I get
<class 'int'>
<class 'numpy.int64'>
I still would like to know why astype doesn't work though!:)
To convert numpy values to python types, there's numpy.ndarray.item
y_test_int = [v.item() for v in y_test]
print(type(y_test_int[0]))
#<class 'int'>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With