Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculating first quartile for a numeric column in spark

I'm a newbie in spark/scala. This is what I'm doing to calculate the first quartile of a csv file

val column= sc.textFile("test.txt").map(_.split(",")(2)).flatMap(_.split(",")).map((_.toDouble))
val total = column.count.toDouble
val upper=(total+1)/4
val upper2= scala.math.ceil(upper).toInt

I'm not really sure how to sort the column other than adding a Key Value Pair. all I need is to take the last 2 values for the quartiles, after they are sorted. But i'm forced to create a key value pair.

val quartiles = column.map((_,1)).sortByKey(true).take(upper2)
val first_quartile =0
if(upper % upper.toInt >0){
   first_quartile = quartiles(upper.toInt-1) 
}else{
   first_quartile = (quartiles(upper2-1) +(quartiles(upper2-2))/2
}

This works, but it will leave me with an annoying key value pair. how do i revert back to just 1 column, instead of 2 (e.g. the key value pair)

like image 532
user2773013 Avatar asked Jun 23 '14 23:06

user2773013


People also ask

How do you find the first quartile of a numerical data set?

Example of Quartile First, mark down the median, Q2, which in this case is the 10th value: 75. Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be included when calculating Q1 or Q3 for an odd set of values.


1 Answers

Was just doing this myself. I started out writing a function to compute the median, but found it was faster and easier to get quantiles by casting my RDD as a DataFrame and querying it with SQL. Here's a self-contained example:

  // construct example RDD
  val rows = Seq(3, 1, 5, 1, 9, 2, 2)
  val rdd = sc.parallelize(rows)

  // construct Dataframe
  case class MedianDF(value: Long)
  val df = rdd.map(row => MedianDF(row.toLong)).toDF 

  // register the table and then query for your desired percentile
  df.registerTempTable("table")
  sqlContext.sql("SELECT PERCENTILE(value, 0.5) FROM table").show()

Which returns 2, the median. Similarly, if you want the first quartile just pass 0.25 to PERCENTILE:

sqlContext.sql("SELECT PERCENTILE(value, 0.25) FROM table").show()
like image 101
Erin Shellman Avatar answered Sep 29 '22 01:09

Erin Shellman