Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :

import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._


val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
like image 487
HISI Avatar asked Aug 07 '18 11:08

HISI


People also ask

Does groupBy preserve order PySpark?

The short answer is Yes, the hourly counts will maintain the same order.

What does count () do in Spark?

In Spark, the Count function returns the number of elements present in the dataset.

What is RelationalGroupedDataset?

public class RelationalGroupedDataset extends Object. A set of methods for aggregations on a DataFrame , created by groupBy , cube or rollup (and also pivot ). The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean , sum for convenience.

How do I get other columns with Spark DataFrame groupBy?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.


2 Answers

You can use sort or orderBy as below

val df_count = df.groupBy("id").count()

df_count.sort(desc("count")).show(false)

df_count.orderBy($"count".desc).show(false)

Don't use collect() since it brings the data to the driver as an Array.

Hope this helps!

like image 143
koiralo Avatar answered Oct 27 '22 20:10

koiralo


//import the SparkSession which is the entry point for spark underlying API to access
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.functions._

 val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
    df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
    df.show
//for projecting more than 20 records  eg:
    df.show(50)
like image 30
Gagan Sp Avatar answered Oct 27 '22 19:10

Gagan Sp