As I have a collection: <pre class="prettyprint"><code>List(1, 3,-1, 0, 2, -4, 6) </code></pre> It's easy to make it sorted as: <pre class="prettyprint"><code>List(-4, -1, 0, 1, 2, 3, 6) </code></pre> Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this: <pre class="prettyprint"><code>for(i <- 0 to list.length -2) yield { list(i + 1) - list(i) } </code></pre> and get a vector: <pre class="prettyprint"><code>Vector(3, 1, 1, 1, 1, 3) </code></pre> That is, I want to make the next element minus the current element. But how to implement this in RDD on Spark? I know for the collection: <pre class="prettyprint"><code>List(-4, -1, 0, 1, 2, 3, 6) </code></pre> There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?

The most efficient solution is to use <code>sliding</code> method: <pre class="prettyprint"><code>import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6)) .sortBy(identity) .sliding(2) .map{case Array(x, y) => y - x} </code></pre>

Operate on neighbor elements in RDD in Spark

Tags:

scala

apache-spark

As I have a collection:

List(1, 3,-1, 0, 2, -4, 6)

It's easy to make it sorted as:

List(-4, -1, 0, 1, 2, 3, 6)

Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:

for(i <- 0 to list.length -2) yield {
    list(i + 1) - list(i)
}

and get a vector:

Vector(3, 1, 1, 1, 1, 3)

That is, I want to make the next element minus the current element.

But how to implement this in RDD on Spark?

I know for the collection:

List(-4, -1, 0, 1, 2, 3, 6)

There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?

825

asked Dec 08 '15 02:12

xring

1 Answers

The most efficient solution is to use sliding method:

import org.apache.spark.mllib.rdd.RDDFunctions._

val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
  .sortBy(identity)
  .sliding(2)
  .map{case Array(x, y) => y - x}

151

answered Sep 19 '22 01:09

zero323

Related questions
                            
                                Why do these similar looking statements yield objects of different types?
                            
                                @Repeat Form Helper with complex object - Play Framework
                            
                                Scala raw strings error in unicode escape
                            
                                Play Framework 2.X and blocking database call
                            
                                How to compute the mean with Apache spark?
                            
                                Spark Streaming Window Operation
                            
                                scala future error for " Don't call `Awaitable` methods directly, use the `Await` object."
                            
                                how to flatten disjunction type
                            
                                How do I consume -D variables in build.scala using SBT?
                            
                                What's the purpose of macros?
                            
                                Scala version of Jgit
                            
                                Scala Play no application started when grabbing data sources from application.conf
                            
                                Uncaught exception during compilation: java.lang.AssertionError
                            
                                Apache Spark - How does internal job scheduler in spark define what are users and what are pools
                            
                                How to implement Future as Applicative in Scala?
                            
                                How to pattern match a scala immutable queue?
                            
                                Json implicit format with recursive class definition
                            
                                Cannot resolve symbol 'play' error with Play Framework 2.4.x and IntellijIdea 14.x
                            
                                SBT: Exclude resource subdirectory
                            
                                On Spark's RDD's take and takeOrdered methods

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With