I'm currently trying to train a data set with a decision tree classifier but I couldn't get the train_test_split to work. From the code below CS is the target output and EN SN JT FT PW YR LO LA are features input. All variables that went through OHL are in sparse matrix format whereas the other are in array taken straight from the dataframe. <pre class="prettyprint"><code>def OHL(x, column): #OneHotEncoder le = LabelEncoder() enc = OneHotEncoder() Labeled = le.fit_transform(x[column].astype(str)) return enc.fit_transform(Labeled.reshape(-1,1)) ###------------------------------------------------------------------------ df = pd.read_csv('h1b_kaggle.csv') df = df.drop(['Unnamed: 0','WORKSITE'],1) ###------------------------------------------------------------------------ CS = OHL(df, 'CASE_STATUS') EN = OHL(df, 'EMPLOYER_NAME') SN = OHL(df, 'SOC_NAME') JT = OHL(df, 'JOB_TITLE') FT = OHL(df, 'FULL_TIME_POSITION') PW = np.array(df['PREVAILING_WAGE']) YR = OHL(df, 'YEAR') LO = np.array(df['lon']) LA = np.array(df['lat']) </code></pre>

If you look at <code>sklearn.model_selection.train_test_split</code>, you can see it takes an <code>*arrays</code> argument. To split the first three of your arguments, therefore, you could use <pre class="prettyprint"><code>CS_tr, CS_te, EN_tr, EN_te, SN_tr, SN_te = train_test_split(CS, EN, SN) </code></pre> (of course, you can pass more arrays than that). Note that current versions of <code>sklearn</code> return sparse arrays when given sparse arrays.

train_test_split with multiple features

Tags:

python

python-3.x

pandas

dataframe

scikit-learn

I'm currently trying to train a data set with a decision tree classifier but I couldn't get the train_test_split to work.

From the code below CS is the target output and EN SN JT FT PW YR LO LA are features input.

All variables that went through OHL are in sparse matrix format whereas the other are in array taken straight from the dataframe.

def OHL(x, column): #OneHotEncoder
    le = LabelEncoder()
    enc = OneHotEncoder()
    Labeled = le.fit_transform(x[column].astype(str))
    return enc.fit_transform(Labeled.reshape(-1,1))

###------------------------------------------------------------------------

df = pd.read_csv('h1b_kaggle.csv')
df = df.drop(['Unnamed: 0','WORKSITE'],1)

###------------------------------------------------------------------------

CS = OHL(df, 'CASE_STATUS')
EN = OHL(df, 'EMPLOYER_NAME')
SN = OHL(df, 'SOC_NAME')
JT = OHL(df, 'JOB_TITLE')
FT = OHL(df, 'FULL_TIME_POSITION')
PW = np.array(df['PREVAILING_WAGE'])
YR = OHL(df, 'YEAR')
LO = np.array(df['lon'])
LA = np.array(df['lat'])

315

asked Apr 14 '18 07:04

Ekkasit Smithipanon

1 Answers

If you look at sklearn.model_selection.train_test_split, you can see it takes an *arrays argument. To split the first three of your arguments, therefore, you could use

CS_tr, CS_te, EN_tr, EN_te, SN_tr, SN_te = train_test_split(CS, EN, SN)

(of course, you can pass more arrays than that).

Note that current versions of sklearn return sparse arrays when given sparse arrays.

107

answered Oct 28 '22 18:10

Ami Tavory

Related questions
                            
                                Non-ASCII Python identifiers and reflectivity [duplicate]
                            
                                AUTH_USER_MODEL refers to model 'accounts.User' that has not been installed
                            
                                sklearn - how to incorporate missing data when one-hot encoding
                            
                                Django, update the object after a prefetch_related
                            
                                Fastest way to find unique combinations of list
                            
                                Time series correlation with pandas
                            
                                Python - TypeError: Can't mix strings and bytes in path components
                            
                                Tensorflow dataset data preprocessing is done once for the whole dataset or for each call to iterator.next()?
                            
                                De-spiking a non-periodic signal
                            
                                How to create new environment from a text file without environment name?
                            
                                What is dynamic dispatch and duck typing?
                            
                                LSTM in Pytorch
                            
                                Getting max values from pandas multiindex dataframe
                            
                                How to get around in place operation error if index leaf variable for gradient update?
                            
                                Replicating Jupyter Notebook Pandas dataframe HTML printout
                            
                                How are metrics computed in Keras?
                            
                                Efficient way of generating latin squares (or randomly permute numbers in matrix uniquely on both axes - using NumPy)
                            
                                Pandas: merge_asof-like solutions for merging two multi-indexed DataFrames?
                            
                                Keras LSTM Multiple Input Multiple Output
                            
                                How to use AsciiDoc with Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With