Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: count percentage percentages of a column values

I am trying to improve my Spark Scala skills and I have this case which I cannot find a way to manipulate so please advise!

I have original data as it shown in the figure bellow:

enter image description here

I want to calculate the percentage of every result of the count column . E.g. the last error value is 64 how much is 64 as a percentage out of the all column values. Please note that I am reading the original data as Dataframes using sqlContext: Here is my code:

    val df1 = df.groupBy(" Code")
.agg(sum("count").alias("sum"), mean("count")
.multiply(100)
.cast("integer").alias("percentag‌​e")) 

I want results similar to this:

enter image description here

Thanks in advance!

like image 653
Foaad Mohamad Haddod Avatar asked Oct 21 '17 12:10

Foaad Mohamad Haddod


People also ask

How do you find a percentage of a column in PySpark?

Sum() function and partitionBy() the column name, is used to calculate the cumulative percentage of column by group. We use sum function to sum up the price column and partitionBy() function to calculate the cumulative percentage of column as shown below and we name it as price_percent.

How do you aggregate in spark?

You need to define a key or grouping in aggregation. You can also define an aggregation function that specifies how the transformations will be performed among the columns. If you give multiple values as input, the aggregation function will generate one result for each group.

How do I count the number of columns in a spark data frame?

PySpark Get Column Count To get the number of columns present in the PySpark DataFrame, use DataFrame. columns with len() function.

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.


1 Answers

Use agg and window functions:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

df
  .groupBy("code")
  .agg(sum("count").alias("count"))
  .withColumn("fraction", col("count") /  sum("count").over())
like image 56
user8811088 Avatar answered Nov 16 '22 02:11

user8811088