Spark simpler value_counts

Tags:

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)

I am curious if this can't be achieved nicer / simpler for data frames in Spark.

683

asked Nov 21 '16 17:11

Georg Heiler

1 Answers

It is just a basic aggregation, isn't it?

df.groupBy($"value").count.orderBy($"count".desc)

Pandas:

import pandas as pd

pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()

2    3
3    2
4    1
1    1
dtype: int64

Spark SQL:

Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
  .groupBy($"value").count.orderBy($"count".desc)

+-----+-----+
|value|count|
+-----+-----+
|    2|    3|
|    3|    2|
|    1|    1|
|    4|    1|
+-----+-----+

If you want to include additional grouping columns (like "key") just put these in the groupBy:

df.groupBy($"key", $"value").count.orderBy($"count".desc)

177

answered Sep 17 '22 15:09

zero323

Related questions
                            
                                Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space
                            
                                Loading a pyspark ML model in a non-Spark environment
                            
                                Monitoring Structured Streaming
                            
                                SparkR filterRDD and flatMap not working
                            
                                Can do without spark-submit in java?
                            
                                Connecting to remote master on standalone Spark
                            
                                Unable to launch SparkR in RStudio
                            
                                In Spark, is it possible to share data between two executors?
                            
                                Object cache on Spark executors
                            
                                How to flatten the data of different data types by using Sparklyr package?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Use schema to convert AVRO messages with Spark to DataFrame
                            
                                Distributed Map in Scala Spark
                            
                                Apache Spark EOF exception
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set
                            
                                How to read records in JSON format from Kafka using Structured Streaming?
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark simpler value_counts

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

Georg Heiler

People also ask

1 Answers

zero323

Recent Activity

Donate For Us