I am trying to improve my Spark Scala skills and I have this case which I cannot find a way to manipulate so please advise!
I have original data as it shown in the figure bellow:
I want to calculate the percentage of every result of the count column . E.g. the last error value is 64 how much is 64 as a percentage out of the all column values. Please note that I am reading the original data as Dataframes using sqlContext: Here is my code:
val df1 = df.groupBy(" Code")
.agg(sum("count").alias("sum"), mean("count")
.multiply(100)
.cast("integer").alias("percentage"))
I want results similar to this:
Thanks in advance!
Sum() function and partitionBy() the column name, is used to calculate the cumulative percentage of column by group. We use sum function to sum up the price column and partitionBy() function to calculate the cumulative percentage of column as shown below and we name it as price_percent.
You need to define a key or grouping in aggregation. You can also define an aggregation function that specifies how the transformations will be performed among the columns. If you give multiple values as input, the aggregation function will generate one result for each group.
PySpark Get Column Count To get the number of columns present in the PySpark DataFrame, use DataFrame. columns with len() function.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
Use agg
and window functions:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
df
.groupBy("code")
.agg(sum("count").alias("count"))
.withColumn("fraction", col("count") / sum("count").over())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With