Spark ML VectorAssembler returns strange output

Question

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this.

My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))

My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")

I then use a VectorAssembler like this:

val assembler = new VectorAssembler()
                           .setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
                           .setOutputCol("features")

val assemblerData = assembler.transform(data)

So when I print a row of my data before it goes into the VectorAssembler it looks like this:

[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]

After the transform function of VectorAssembler I print the same row of data and get this:

[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]

What on earth is going on? What has the VectorAssembler done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?

eliasah · Accepted Answer

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation.

To explain further :

It seems like your vector is composed of 18 elements (dimension).

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.

In case you are interested in getting feature importance, thus I advise you to take a look at this.

Spark ML VectorAssembler returns strange output

Tags:

scala

apache-spark

apache-spark-ml

apache-spark-mllib

Dimitris

1 Answers

eliasah

Recent Activity

Donate For Us

Spark ML VectorAssembler returns strange output

Tags:

scala

apache-spark

apache-spark-ml

apache-spark-mllib

Dimitris

1 Answers

eliasah

Related questions

Recent Activity

Donate For Us