How to turn a known structured RDD to Vector

Tags:

Assuming I have an RDD containing (Int, Int) tuples. I wish to turn it into a Vector where first Int in tuple is the index and second is the value.

Any Idea how can I do that?

I update my question and add my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector in order to update a single accumulator instead of multiple accumulators.

There for my final solution was:

reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
  val v = Array(0,0,0,0)
  v(x) = y
  accumulator += new Vector(v)
}}))

Using Vector from accumulator example in documentation.

832

asked Dec 18 '14 21:12

Noam Shaish

2 Answers

rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}

Turn the RDD into a Map. Then iterate over that, building a Vector as we go.

You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.

180

answered Nov 15 '22 04:11

The Archetypal Paul

One key thing: do you really need Vector? Map could be much more suitable.

If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.

Hope this is clear enough.

answered Nov 15 '22 05:11

Roman Nikitchenko

Related questions
                            
                                Have sbt put javadocs and sources of dependencies on the class path
                            
                                Scala Future not Awaitable?
                            
                                convert Iteratee to Result
                            
                                Scala function pointer
                            
                                In Scala how do I filter by reified types at runtime?
                            
                                Does there exist in Scala, or a library, an equivalent to Clojure's diff as applied to maps?
                            
                                Pattern match empty ArrayBuffer
                            
                                Scala Spray Routing Syntax
                            
                                How to insert batch of 1000 records into db?
                            
                                What is the execution order (sequential or concurrent) of futures inside of for comprehension?
                            
                                Issue understanding splitting data in Scala using "randomSplit" for Machine Learning purpose
                            
                                How to set settings for a subproject in sbt shell (without using project command)?
                            
                                How is that .value can be called on a SettingKey or TaskKey?
                            
                                SBT: How to run an annotation processing plugin
                            
                                How to append an element to HList
                            
                                How to choose multiplication monoid instead of addition monoid?
                            
                                Using vals from scala package object in java
                            
                                Does Shapeless 2.0.0 lens work with Lists/collections?
                            
                                Using multiple Writes with Play JSON to render different views of an object
                            
                                Scala: refer to nested type of class, which comes as type parameter to generic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to turn a known structured RDD to Vector

Tags:

scala

vector

distributed-computing

apache-spark

rdd

Noam Shaish

People also ask

2 Answers

The Archetypal Paul

Roman Nikitchenko

Recent Activity

Donate For Us