Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

train_test_split with multiple features

I'm currently trying to train a data set with a decision tree classifier but I couldn't get the train_test_split to work.

From the code below CS is the target output and EN SN JT FT PW YR LO LA are features input.

All variables that went through OHL are in sparse matrix format whereas the other are in array taken straight from the dataframe.

def OHL(x, column): #OneHotEncoder
    le = LabelEncoder()
    enc = OneHotEncoder()
    Labeled = le.fit_transform(x[column].astype(str))
    return enc.fit_transform(Labeled.reshape(-1,1))

###------------------------------------------------------------------------

df = pd.read_csv('h1b_kaggle.csv')
df = df.drop(['Unnamed: 0','WORKSITE'],1)

###------------------------------------------------------------------------

CS = OHL(df, 'CASE_STATUS')
EN = OHL(df, 'EMPLOYER_NAME')
SN = OHL(df, 'SOC_NAME')
JT = OHL(df, 'JOB_TITLE')
FT = OHL(df, 'FULL_TIME_POSITION')
PW = np.array(df['PREVAILING_WAGE'])
YR = OHL(df, 'YEAR')
LO = np.array(df['lon'])
LA = np.array(df['lat'])
like image 315
Ekkasit Smithipanon Avatar asked Apr 14 '18 07:04

Ekkasit Smithipanon


People also ask

What is Test_size in train_test_split?

test_size is the number that defines the size of the test set. It's very similar to train_size . You should provide either train_size or test_size . If neither is given, then the default share of the dataset that will be used for testing is 0.25 , or 25 percent.


1 Answers

If you look at sklearn.model_selection.train_test_split, you can see it takes an *arrays argument. To split the first three of your arguments, therefore, you could use

CS_tr, CS_te, EN_tr, EN_te, SN_tr, SN_te = train_test_split(CS, EN, SN)

(of course, you can pass more arrays than that).

Note that current versions of sklearn return sparse arrays when given sparse arrays.

like image 107
Ami Tavory Avatar answered Oct 28 '22 18:10

Ami Tavory