I have a data set containing both categorical and numerical columns and my target column is also categorical. I am using Scikit library in Python34. I know that Scikit needs all categorical values to be transformed to numerical values before doing any machine learning approach.
How should I transform my categorical columns to numerical values? I tried a lot of thing but I am getting different errors such as "str" object has no 'numpy.ndarray' object has no attribute 'items'.
Here is an example of my data:
UserID LocationID AmountPaid ServiceID Target
29876 IS345 23.9876 FRDG JFD
29877 IS712 135.98 WERS KOI
My dataset is saved in a CSV file, here is the little code I wrote to give you an idea about what I want to do:
#reading my csv file
data_dir = 'C:/Users/davtalab/Desktop/data/'
train_file = data_dir + 'train.csv'
train = pd.read_csv( train_file )
#numeric columns:
x_numeric_cols = train['AmountPaid']
#Categrical columns:
categorical_cols = ['UserID' + 'LocationID' + 'ServiceID']
x_cat_cols = train[categorical_cols].as_matrix()
y_target = train['Target'].as_matrix()
I need x_cat_cols to be converted to numeric values and the add them to x_numeric_cols and so have my complete input (x) values.
Then I need to convert my target function into numeric value as well and make that as my final target (y) column.
Then I want to do a Random Forest using these two complete sets as:
rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs)
rf.fit( x_train, y_train )
Thanks for your help!
Definition of categorical 1 : absolute, unqualified a categorical denial. 2a : of, relating to, or constituting a category. b : involving, according with, or considered with respect to specific categories a categorical system for classifying books.
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
Categorical variables are sets of variables with values assigned to distinct and limited groups or categories. Categorical variables take on values in a set of categories, different from a continuous variable, which takes on a range of values. Categorical variables are also called discrete or nominal variables.
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories.
For target, you can use sklearn's LabelEncoder. This will give you a converter from string labels to numeric ones (and also a reverse mapping). Example in the link.
As for features, learning algorithms in general expect (or work best with) ordinal data. So the best option is to use OneHotEncoder to convert the categorical features. This will generate a new binary feature for each category, denoting on/off for each category. Again, usage example in the link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With