Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)

I am curious if this can't be achieved nicer / simpler for data frames in Spark.

like image 683
Georg Heiler Avatar asked Nov 21 '16 17:11

Georg Heiler


People also ask

Is Value_counts () sorted?

By default, value_counts will sort the data by numeric count in descending order. The ascending parameter enables you to change this. When you set ascending = True , value counts will sort the data by count from low to high (i.e., ascending order).

What is the difference between Count and Value_counts in pandas?

count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.

What is Val in spark SQL?

This is Scala semantics. A val is an immutable reference which gets evaluated once at the declaration site.


1 Answers

It is just a basic aggregation, isn't it?

df.groupBy($"value").count.orderBy($"count".desc)

Pandas:

import pandas as pd

pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2    3
3    2
4    1
1    1
dtype: int64

Spark SQL:

Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
  .groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
|    2|    3|
|    3|    2|
|    1|    1|
|    4|    1|
+-----+-----+

If you want to include additional grouping columns (like "key") just put these in the groupBy:

df.groupBy($"key", $"value").count.orderBy($"count".desc)
like image 177
zero323 Avatar answered Sep 17 '22 15:09

zero323