I have the following dataframe with the two first row looking like:
['station_id', 'country', 'temperature', 'time']
['12', 'usa', '22', '12:04:14']
I want to display the average temperature by descending order of the first 100 stations in 'france'.
What is the best way (Most efficient) to do it in pyspark?
We translate your query to Spark SQL
in the following way:
from pyspark.sql.functions import mean, desc
df.filter(df["country"] == "france") \ # only french stations
.groupBy("station_id") \ # by station
.agg(mean("temperature").alias("average_temp")) \ # calculate average
.orderBy(desc("average_temp")) \ # order by average
.take(100) # return first 100 rows
Using the RDD
API and anonymous functions:
df.rdd \
.filter(lambda x: x[1] == "france") \ # only french stations
.map(lambda x: (x[0], x[2])) \ # select station & temp
.mapValues(lambda x: (x, 1)) \ # generate count
.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count
.mapValues(lambda x: x[0]/x[1]) \ # calculate average
.sortBy(lambda x: x[1], ascending = False) \ # sort
.take(100)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With