I have a huge file in HDFS having Time Series data points (Yahoo Stock prices). I want to find the moving average of the Time Series how do I go about writing the Apache Spark job to do that .

You can use the sliding function from MLLIB which probably does the same thing as Daniel's answer. You will have to sort the data by time before using the sliding function. <pre class="prettyprint"><code>import org.apache.spark.mllib.rdd.RDDFunctions._ sc.parallelize(1 to 100, 10) .sliding(3) .map(curSlice => (curSlice.sum / curSlice.size)) .collect() </code></pre>

Moving average is a tricky problem for Spark, and any distributed system. When the data is spread across multiple machines, there will be some time windows that cross partitions. We have to duplicate the data at the start of the partitions, so that calculating the moving average per partition gives complete coverage. Here is a way to do this in Spark. The example data: <pre class="prettyprint"><code>val ts = sc.parallelize(0 to 100, 10) val window = 3 </code></pre> A simple partitioner that puts each row in the partition we specify by the key: <pre class="prettyprint"><code>class StraightPartitioner(p: Int) extends org.apache.spark.Partitioner { def numPartitions = p def getPartition(key: Any) = key.asInstanceOf[Int] } </code></pre> Create the data with the first <code>window - 1</code> rows copied to the previous partition: <pre class="prettyprint"><code>val partitioned = ts.mapPartitionsWithIndex((i, p) => { val overlap = p.take(window - 1).toArray val spill = overlap.iterator.map((i - 1, _)) val keep = (overlap.iterator ++ p).map((i, _)) if (i == 0) keep else keep ++ spill }).partitionBy(new StraightPartitioner(ts.partitions.length)).values </code></pre> Just calculate the moving average on each partition: <pre class="prettyprint"><code>val movingAverage = partitioned.mapPartitions(p => { val sorted = p.toSeq.sorted val olds = sorted.iterator val news = sorted.iterator var sum = news.take(window - 1).sum (olds zip news).map({ case (o, n) => { sum += n val v = sum sum -= o v }}) }) </code></pre> Because of the duplicate segments this will have no gaps in coverage. <pre class="prettyprint"><code>scala> movingAverage.collect.sameElements(3 to 297 by 3) res0: Boolean = true </code></pre>

Apache Spark Moving Average

2 Answers

You can use the sliding function from MLLIB which probably does the same thing as Daniel's answer. You will have to sort the data by time before using the sliding function.

import org.apache.spark.mllib.rdd.RDDFunctions._  sc.parallelize(1 to 100, 10)   .sliding(3)   .map(curSlice => (curSlice.sum / curSlice.size))   .collect()

102

answered Oct 23 '22 03:10

Arvind

Moving average is a tricky problem for Spark, and any distributed system. When the data is spread across multiple machines, there will be some time windows that cross partitions. We have to duplicate the data at the start of the partitions, so that calculating the moving average per partition gives complete coverage.

Here is a way to do this in Spark. The example data:

val ts = sc.parallelize(0 to 100, 10) val window = 3

A simple partitioner that puts each row in the partition we specify by the key:

class StraightPartitioner(p: Int) extends org.apache.spark.Partitioner {   def numPartitions = p   def getPartition(key: Any) = key.asInstanceOf[Int] }

Create the data with the first window - 1 rows copied to the previous partition:

val partitioned = ts.mapPartitionsWithIndex((i, p) => {   val overlap = p.take(window - 1).toArray   val spill = overlap.iterator.map((i - 1, _))   val keep = (overlap.iterator ++ p).map((i, _))   if (i == 0) keep else keep ++ spill }).partitionBy(new StraightPartitioner(ts.partitions.length)).values

Just calculate the moving average on each partition:

val movingAverage = partitioned.mapPartitions(p => {   val sorted = p.toSeq.sorted   val olds = sorted.iterator   val news = sorted.iterator   var sum = news.take(window - 1).sum   (olds zip news).map({ case (o, n) => {     sum += n     val v = sum     sum -= o     v   }}) })

Because of the duplicate segments this will have no gaps in coverage.

scala> movingAverage.collect.sameElements(3 to 297 by 3) res0: Boolean = true

answered Oct 23 '22 02:10

Daniel Darabos

Related questions
                            
                                Changing tick intervals when x axis values are dates
                            
                                Forecasting time series data
                            
                                simple examples of filter function, recursive option specifically
                            
                                pandas.DatetimeIndex frequency is None and can't be set
                            
                                Time series forecasting (eventually with python) [closed]
                            
                                Is it possible to do multivariate multi-step forecasting using FB Prophet?
                            
                                Request for example: Recurrent neural network for predicting next value in a sequence
                            
                                Fast Fourier Transform in R
                            
                                R: How to extract dates from a time series
                            
                                Calculate percentage change in an R data frame
                            
                                how to implement walk forward testing in sklearn?
                            
                                Counting frequency of values by date using pandas
                            
                                R time series modeling on weekly data using ts() object
                            
                                Python & Pandas - Group by day and count for each day
                            
                                Using JFreeChart to display recent changes in a time series
                            
                                How to convert dataframe into time series?
                            
                                storing massive ordered time series data in bigtable derivatives
                            
                                Basic lag in R vector/dataframe
                            
                                How to properly add hours to a pandas.tseries.index.DatetimeIndex?
                            
                                Pandas compare next row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark Moving Average

Tags:

apache-spark

time-series

moving-average

hdfs

Ahmed Shabib

People also ask

2 Answers

Arvind

Daniel Darabos

Recent Activity

Donate For Us