Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable.

In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables.

How can I perform this with Spark?

like image 284
SparkUser Avatar asked May 07 '15 14:05

SparkUser


People also ask

Can categorical variables be transformed?

- Categorical Variable Transformation: is turning a categorical variable to a numeric variable. Categorical variable transformation is mandatory for most of the machine learning models because they can handle only numeric values.

How do you convert categorical features to numerical?

We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.

How do you recode a categorical variable?

Recoding a categorical variable The easiest way is to use revalue() or mapvalues() from the plyr package. This will code M as 1 and F as 2 , and put it in a new column.

Can categorical variables be encoded?

Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the different models. In the field of data science, before going for the modelling, data preparation is a mandatory task.


1 Answers

Using VectorIndexer, you may tell the indexer the number of different values (cardinality) that a field may have in order to be considered categorical with the setMaxCategories() method.

val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)

From Scaladocs:

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

Automatically identify categorical features (default behavior)

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.

Set maxCategories to the maximum number of categorical any categorical feature should have.

E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.

I find this a convenient (though coarse-grained) way to extract the categorical values, but beware if in any case you have a field with lower arity that you want to be continuous (e.g. age in college students vs origin country or US-state).

like image 106
xmar Avatar answered Sep 16 '22 20:09

xmar