Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the distinct elements of each group by other field on a Spark 1.6 Dataframe

I'm trying to group by date in a Spark dataframe and for each group count the unique values of one column:

test.json {"name":"Yin", "address":1111111, "date":20151122045510} {"name":"Yin", "address":1111111, "date":20151122045501} {"name":"Yln", "address":1111111, "date":20151122045500} {"name":"Yun", "address":1111112, "date":20151122065832} {"name":"Yan", "address":1111113, "date":20160101003221} {"name":"Yin", "address":1111111, "date":20160703045231} {"name":"Yin", "address":1111114, "date":20150419134543} {"name":"Yen", "address":1111115, "date":20151123174302} 

And the code:

import pyspark.sql.funcions as func from pyspark.sql.types import TimestampType from datetime import datetime  df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date))     df_g.count().distinct().show() 

The results with pyspark are

df_y.groupby(df_y.name).count().distinct().show() +----+-----+ |name|count| +----+-----+ | Yan|    1| | Yun|    1| | Yin|    4| | Yen|    1| | Yln|    1| +----+-----+ 

And what I'm expecting is something like this with pandas:

df = df_y.toPandas() df.groupby('name').address.nunique() Out[51]:  name Yan    1 Yen    1 Yin    2 Yln    1 Yun    1 

How can I get the unique elements of each group by another field, like address?

like image 693
Ivan Avatar asked Mar 17 '16 15:03

Ivan


People also ask

How do I count distinct values in spark DataFrame?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you select distinct in PySpark?

We will use the select() method to get the distinct rows from the selected columns, the select() method is used to select columns, and after that, we have to use the distinct() function to return unique values from the selected column, and Finally, we have to use collect() method to get the rows returned by the ...

What does distinct do in spark?

distinct()Returns a new DataFrame containing the distinct rows in this DataFrame . Now if you need to consider only a subset of the columns when dropping duplicates, then you first have to make a column selection before calling distinct() as shown below.

How does PySpark distinct work?

PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. In this article, you will learn how to use distinct() and dropDuplicates() functions with PySpark example.


2 Answers

There's a way to do this count of distinct elements of each group using the function countDistinct:

import pyspark.sql.functions as func from pyspark.sql.types import TimestampType from datetime import datetime  df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date))     df_y.groupby(df_y.name).agg(func.countDistinct('address')).show()  +----+--------------+ |name|count(address)| +----+--------------+ | Yan|             1| | Yun|             1| | Yin|             2| | Yen|             1| | Yln|             1| +----+--------------+ 

The docs are available [here](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark.sql.Column, org.apache.spark.sql.Column...)).

like image 101
Ivan Avatar answered Sep 30 '22 20:09

Ivan


a concise and direct answer to groupby a field "_c1" and count the distinct number of values from field "_c2":

import pyspark.sql.functions as F  dg = df.groupBy("_c1").agg(F.countDistinct("_c2")) 
like image 25
Quetzalcoatl Avatar answered Sep 30 '22 19:09

Quetzalcoatl