I'm currently trying to train a data set with a decision tree classifier but I couldn't get the train_test_split to work.
From the code below CS is the target output and EN SN JT FT PW YR LO LA are features input.
All variables that went through OHL are in sparse matrix format whereas the other are in array taken straight from the dataframe.
def OHL(x, column): #OneHotEncoder
le = LabelEncoder()
enc = OneHotEncoder()
Labeled = le.fit_transform(x[column].astype(str))
return enc.fit_transform(Labeled.reshape(-1,1))
###------------------------------------------------------------------------
df = pd.read_csv('h1b_kaggle.csv')
df = df.drop(['Unnamed: 0','WORKSITE'],1)
###------------------------------------------------------------------------
CS = OHL(df, 'CASE_STATUS')
EN = OHL(df, 'EMPLOYER_NAME')
SN = OHL(df, 'SOC_NAME')
JT = OHL(df, 'JOB_TITLE')
FT = OHL(df, 'FULL_TIME_POSITION')
PW = np.array(df['PREVAILING_WAGE'])
YR = OHL(df, 'YEAR')
LO = np.array(df['lon'])
LA = np.array(df['lat'])
test_size is the number that defines the size of the test set. It's very similar to train_size . You should provide either train_size or test_size . If neither is given, then the default share of the dataset that will be used for testing is 0.25 , or 25 percent.
If you look at sklearn.model_selection.train_test_split
, you can see it takes an *arrays
argument. To split the first three of your arguments, therefore, you could use
CS_tr, CS_te, EN_tr, EN_te, SN_tr, SN_te = train_test_split(CS, EN, SN)
(of course, you can pass more arrays than that).
Note that current versions of sklearn
return sparse arrays when given sparse arrays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With