Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strings as features in decision tree/random forest

I am new to machine learning!

Right now I am doing some problems on application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I want to inject the strings as well as they carry significant amount of knowledge.

How do I handle such scenario, I can convert string to numbers by some mechanism such as hashing in python. But I would like to know the best practice on how strings are handled in decision tree problems.

like image 471
user3001408 Avatar asked Sep 15 '25 22:09

user3001408


1 Answers

1) How to add "strings" as features.

Very few algorithms can natively handle strings in any form, and decision trees are not one of them. You have to convert them to something that the decision tree knows about (generally numeric or categorical variables).

How to convert them to features: This very much depends on the nature of the strings. If the strings are sentences, you can use things like bag of words to map each word to a numeric feature. There are numerous different strategies for determining what numeric value to use, but just using 0/1 for not present / present is often a decent baseline.

For countries, this doesn't make sense as you're representing your feature in the wrong way. A country is more akin to a categorical variable. There are only X countries and you must have a value that is in X (this may not be strictly absolutely true, but that's beyond the point). scikit-learn doesn't have support for categorical variables. You can "fake" it by using a one-hot-encoding, but it likely will not work quite as well as a library that fully supports categorical variables.

Note that just because countries can be represented as categories doesn't mean that it is the best way to handle them. It depends highly on what your data is and what you are doing. No one can answer it for you without knowing all the details.

like image 151
Raff.Edward Avatar answered Sep 17 '25 18:09

Raff.Edward