Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the most efficient way in pyspark to reduce a dataframe?

I have the following dataframe with the two first row looking like:

['station_id', 'country', 'temperature', 'time']
['12', 'usa', '22', '12:04:14']

I want to display the average temperature by descending order of the first 100 stations in 'france'.

What is the best way (Most efficient) to do it in pyspark?

like image 598
mel Avatar asked Dec 17 '16 04:12

mel


1 Answers

We translate your query to Spark SQL in the following way:

from pyspark.sql.functions import mean, desc

df.filter(df["country"] == "france") \ # only french stations
  .groupBy("station_id") \ # by station
  .agg(mean("temperature").alias("average_temp")) \ # calculate average
  .orderBy(desc("average_temp")) \ # order by average 
  .take(100) # return first 100 rows

Using the RDD API and anonymous functions:

df.rdd \
  .filter(lambda x: x[1] == "france") \ # only french stations
  .map(lambda x: (x[0], x[2])) \ # select station & temp
  .mapValues(lambda x: (x, 1)) \ # generate count
  .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count
  .mapValues(lambda x: x[0]/x[1]) \ # calculate average
  .sortBy(lambda x: x[1], ascending = False) \ # sort
  .take(100)
like image 77
mtoto Avatar answered Nov 05 '22 06:11

mtoto