Spark: Input a vector

Question

I'm get into spark and I have problems with Vectors import org.apache.spark.mllib.linalg.{Vectors, Vector}

The input of my program is a text file with contains the output of a RDD(Vector): dataset.txt:

[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]

So, what a try to do is:

val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

I have the error because it read [0.510736518683609 as a number. Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ? I'm really new in spark, sorry if it's a very obvious question.

zero323 · Accepted Answer

Given the input the simplest thing you can do is to use Vectors.parse:

scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors

scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]

It also works with sparse representation:

scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])

Combining it with your data all you need is:

rdd.map(Vectors.parse)

If you expect malformed / empty lines you can wrap it using Try:

import scala.util.Try

rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)

eliasah · Answer

Here is one way to do it :

val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
   s => 
    val vect = s.replaceAll("$$", "").replaceAll("$$","").split(',').map(_.toDouble)
    Vectors.dense(vect)
}

I've just broke the map into line for readability purpose.

Note: Remember, it's simple a string processing on each line.

Spark: Input a vector

Tags:

scala

apache-spark

apache-spark-mllib

Omegas

2 Answers

zero323

eliasah

Recent Activity

Donate For Us

Spark: Input a vector

Tags:

scala

apache-spark

apache-spark-mllib

Omegas

2 Answers

zero323

eliasah

Related questions

Recent Activity

Donate For Us