Count the distinct elements of each group by other field on a Spark 1.6 Dataframe

Tags:

I'm trying to group by date in a Spark dataframe and for each group count the unique values of one column:

test.json {"name":"Yin", "address":1111111, "date":20151122045510} {"name":"Yin", "address":1111111, "date":20151122045501} {"name":"Yln", "address":1111111, "date":20151122045500} {"name":"Yun", "address":1111112, "date":20151122065832} {"name":"Yan", "address":1111113, "date":20160101003221} {"name":"Yin", "address":1111111, "date":20160703045231} {"name":"Yin", "address":1111114, "date":20150419134543} {"name":"Yen", "address":1111115, "date":20151123174302}

And the code:

import pyspark.sql.funcions as func from pyspark.sql.types import TimestampType from datetime import datetime  df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date))     df_g.count().distinct().show()

The results with pyspark are

df_y.groupby(df_y.name).count().distinct().show() +----+-----+ |name|count| +----+-----+ | Yan|    1| | Yun|    1| | Yin|    4| | Yen|    1| | Yln|    1| +----+-----+

And what I'm expecting is something like this with pandas:

df = df_y.toPandas() df.groupby('name').address.nunique() Out[51]:  name Yan    1 Yen    1 Yin    2 Yln    1 Yun    1

How can I get the unique elements of each group by another field, like address?

693

asked Mar 17 '16 15:03

Ivan

2 Answers

There's a way to do this count of distinct elements of each group using the function countDistinct:

import pyspark.sql.functions as func from pyspark.sql.types import TimestampType from datetime import datetime  df_y = sqlContext.read.json("/user/test.json") udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType()) df = df_y.withColumn('datetime', udf_dt(df_y.date)) df_g = df_y.groupby(func.hour(df_y.date))     df_y.groupby(df_y.name).agg(func.countDistinct('address')).show()  +----+--------------+ |name|count(address)| +----+--------------+ | Yan|             1| | Yun|             1| | Yin|             2| | Yen|             1| | Yln|             1| +----+--------------+

The docs are available [here](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark.sql.Column, org.apache.spark.sql.Column...)).

101

answered Sep 30 '22 20:09

Ivan

a concise and direct answer to groupby a field "_c1" and count the distinct number of values from field "_c2":

import pyspark.sql.functions as F  dg = df.groupBy("_c1").agg(F.countDistinct("_c2"))

answered Sep 30 '22 19:09

Quetzalcoatl

Related questions
                            
                                How to return array from C++ function to Python using ctypes
                            
                                While reading file on Python, I got a UnicodeDecodeError. What can I do to resolve this?
                            
                                Why doesn't Python hash function give the same values when run on Android implementation?
                            
                                How to check if variable is a specific class in python?
                            
                                How to alphabetically order a drop-down list in Django admin?
                            
                                py.test does not find tests under a class
                            
                                sqlalchemy primary key without auto-increment
                            
                                Pandas: control new column names when merging two dataframes?
                            
                                Is there any way to print **kwargs in Python
                            
                                How to set x axis values in matplotlib python?
                            
                                How to fix the error "QObject::moveToThread:" in opencv in python?
                            
                                How to locally develop a python package?
                            
                                How do I extend a python module? Adding new functionality to the `python-twitter` package
                            
                                Finding the index of a string in a tuple
                            
                                Python - how do I call external python programs?
                            
                                Drawing a graph or a network from a distance matrix?
                            
                                Python+OpenCV: cv2.imwrite
                            
                                Customizing unittest.mock.mock_open for iteration
                            
                                Python mock patch argument `new` vs `new_callable`
                            
                                Pandas: Find rows which don't exist in another DataFrame by multiple columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count the distinct elements of each group by other field on a Spark 1.6 Dataframe

Tags:

python

apache-spark

pyspark

Ivan

People also ask

2 Answers

Ivan

Quetzalcoatl

Recent Activity

Donate For Us