How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

Tags:

I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable.

In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables.

How can I perform this with Spark?

284

asked May 07 '15 14:05

SparkUser

1 Answers

Using VectorIndexer, you may tell the indexer the number of different values (cardinality) that a field may have in order to be considered categorical with the setMaxCategories() method.

val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)

From Scaladocs:

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

Automatically identify categorical features (default behavior)

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.

Set maxCategories to the maximum number of categorical any categorical feature should have.

E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.

I find this a convenient (though coarse-grained) way to extract the categorical values, but beware if in any case you have a field with lower arity that you want to be continuous (e.g. age in college students vs origin country or US-state).

106

answered Sep 16 '22 20:09

xmar

Related questions
                            
                                I need advice on Play's Json and elegant Option handling in the Writes trait
                            
                                Scala inferred type arguments - Type bounds inferring to 'Nothing'
                            
                                Keeping database session open
                            
                                How to return a tuple inside an EitherT
                            
                                Scala Play Json Reads
                            
                                UNRESOLVED DEPENDENCIES installing Deadbolt on Play Framework 2.2.1
                            
                                Violation of the left identity law for Future monads in scalaz
                            
                                In shapeless, have two lists such that one contains typeclasses of the other
                            
                                How do I get an unwrapped key in Typesafe Config?
                            
                                Scala: How to define a function whose input is (f, args) and whose output is f(args)?
                            
                                Using shapeless to convert tuple of Future to Future of tuple by way of HList
                            
                                How to build Scala applications in Sublime Text 3?
                            
                                How in shapeless do you say that a proof is the empty type (i.e. false)
                            
                                scala spec2 I am Unable to create a test that uses must be_== and failure at the same time
                            
                                How to reference a custom SBT Setting in sub-projects
                            
                                uPickle and ScalaJS: sealed trait serialisation
                            
                                Scala: How to ignore 'SSLHandshakeException'
                            
                                Use an array as a Scala foldLeft accumulator
                            
                                Avoiding deeply nested Option cascades in Scala
                            
                                Increase memory available to Spark shell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

Tags:

scala

apache-spark

categorical-data

bigdata

apache-spark-mllib

SparkUser

People also ask

1 Answers

xmar

Recent Activity

Donate For Us