I'm trying to group by date in a Spark dataframe and for each group count the unique values of one column:
test.json {"name":"Yin", "address":1111111, "date":20151122045510} {"name":"Yin", "address":1111111, "date":20151122045501} {"name":"Yln", "address":1111111, "date":20151122045500} {"name":"Yun", "address":1111112, "date":20151122065832} {"name":"Yan", "address":1111113, "date":20160101003221} {"name":"Yin", "address":1111111, "date":20160703045231} {"name":"Yin", "address":1111114, "date":20150419134543} {"name":"Yen", "address":1111115, "date":20151123174302}
And the code:
import pyspark.sql.funcions as func from pyspark.sql.types import TimestampType from datetime import datetime df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date)) df_g.count().distinct().show()
The results with pyspark are
df_y.groupby(df_y.name).count().distinct().show() +----+-----+ |name|count| +----+-----+ | Yan| 1| | Yun| 1| | Yin| 4| | Yen| 1| | Yln| 1| +----+-----+
And what I'm expecting is something like this with pandas:
df = df_y.toPandas() df.groupby('name').address.nunique() Out[51]: name Yan 1 Yen 1 Yin 2 Yln 1 Yun 1
How can I get the unique elements of each group by another field, like address?
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
We will use the select() method to get the distinct rows from the selected columns, the select() method is used to select columns, and after that, we have to use the distinct() function to return unique values from the selected column, and Finally, we have to use collect() method to get the rows returned by the ...
distinct()Returns a new DataFrame containing the distinct rows in this DataFrame . Now if you need to consider only a subset of the columns when dropping duplicates, then you first have to make a column selection before calling distinct() as shown below.
PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. In this article, you will learn how to use distinct() and dropDuplicates() functions with PySpark example.
There's a way to do this count of distinct elements of each group using the function countDistinct
:
import pyspark.sql.functions as func from pyspark.sql.types import TimestampType from datetime import datetime df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date)) df_y.groupby(df_y.name).agg(func.countDistinct('address')).show() +----+--------------+ |name|count(address)| +----+--------------+ | Yan| 1| | Yun| 1| | Yin| 2| | Yen| 1| | Yln| 1| +----+--------------+
The docs are available [here](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark.sql.Column, org.apache.spark.sql.Column...)).
a concise and direct answer to groupby a field "_c1" and count the distinct number of values from field "_c2":
import pyspark.sql.functions as F dg = df.groupBy("_c1").agg(F.countDistinct("_c2"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With