I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
The short answer is Yes, the hourly counts will maintain the same order.
In Spark, the Count function returns the number of elements present in the dataset.
public class RelationalGroupedDataset extends Object. A set of methods for aggregations on a DataFrame , created by groupBy , cube or rollup (and also pivot ). The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean , sum for convenience.
1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.
You can use sort
or orderBy
as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect()
since it brings the data to the driver as an Array
.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With