Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split

Question

I'm doing this tutorial on machine learning in which the following code is used:

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.data.csv')
df.replace('?', -99999, inplace = True)
df.drop(['id'], 1, inplace = True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

X_train, X_test, y_test, y_train = train_test_split(X, y)

Here is a sample from the csv file:

id,clump_thickness,unif_cell_size,unif_cell_shape, marg_adhesion,
single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli, mitoses,class
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2

When looking at the results from sklearn.model_selection.train_test_split I found out something weird (at least to me). If I run

    print(type(y_test[0]))
    print()
    print(type(X_train[:,1][0]))

I get the following output:

<class 'numpy.int64'>
<class 'int'>

Somehow the values in X_train are of the type int and the values in y_test are of the type numpy.int64. I don't know why train_test_split does this - I don't think it has to do with the data that is being split up - and the documentation doesn't seem to mention it either.

Since I want the values in y_test to be regular integers as well, I tried changing the type of y_test with astype(). Unfortunately, the following code

y_test = y_test.astype(int)
print(type(y_test[0]))

returns

<class 'numpy.int64'>

Question: Why does train_test_split return arrays containing values with different kinds of datatypes? Why am I not able to convert the values in y_test to integers?

Edit: The difference in type is caused by the data. If I run

 print(type(X[:,1][0]))
 print(type(y[0]))

I get

<class 'int'>
<class 'numpy.int64'>

I still would like to know why astype doesn't work though!:)

STJ · Accepted Answer

To convert numpy values to python types, there's numpy.ndarray.item

y_test_int = [v.item() for v in y_test]
print(type(y_test_int[0]))
#<class 'int'>

Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split

Tags:

python

arrays

python-3.x

scikit-learn

Mr. President

1 Answers

STJ

Recent Activity

Donate For Us

Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split

Tags:

python

arrays

python-3.x

scikit-learn

Mr. President

1 Answers

STJ

Related questions

Recent Activity

Donate For Us