Spark ML VectorAssembler() dealing with thousands of columns in dataframe

Question

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:

val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length


// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
  setInputCol("target").
  setOutputCol("indexedLabel")

val featureAssembler = new VectorAssembler().
  setInputCols(featureSIArray).
  setOutputCol("features")

val convpipeline = new Pipeline().
  setStages(Array(labelIndexer, featureAssembler))

val myFeatureTransfer = convpipeline.fit(df)

Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?

Ruxi Zhang · Accepted Answer

I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.

import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line => 
 val indexed = line.split('	').zipWithIndex 
 val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
 val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
 LabeledPoint(mykey,  Vectors.dense(myValues))
}
 val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

Tags:

scala

classification

apache-spark

pipeline

Ruxi Zhang

1 Answers

Ruxi Zhang

Recent Activity

Donate For Us

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

Tags:

scala

classification

apache-spark

pipeline

Ruxi Zhang

1 Answers

Ruxi Zhang

Related questions

Recent Activity

Donate For Us