Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this.

My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))

My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")

I then use a VectorAssembler like this:

val assembler = new VectorAssembler()
                           .setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
                           .setOutputCol("features")

val assemblerData = assembler.transform(data)

So when I print a row of my data before it goes into the VectorAssembler it looks like this:

[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]

After the transform function of VectorAssembler I print the same row of data and get this:

[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]

What on earth is going on? What has the VectorAssembler done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?

like image 213
Dimitris Avatar asked Nov 09 '16 11:11

Dimitris


1 Answers

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation.

To explain further :

It seems like your vector is composed of 18 elements (dimension).

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.

In case you are interested in getting feature importance, thus I advise you to take a look at this.

like image 56
eliasah Avatar answered Sep 22 '22 18:09

eliasah