Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts()
the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.
By default, value_counts will sort the data by numeric count in descending order. The ascending parameter enables you to change this. When you set ascending = True , value counts will sort the data by count from low to high (i.e., ascending order).
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
This is Scala semantics. A val is an immutable reference which gets evaluated once at the declaration site.
It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy
:
df.groupBy($"key", $"value").count.orderBy($"count".desc)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With