Spark: How to translate count(distinct(value)) in Dataframe API's

Tags:

I'm trying to compare different ways to aggregate my data.

This is my input data with 2 elements (page,visitor):

(PAG1,V1) (PAG1,V1) (PAG2,V1) (PAG2,V2) (PAG2,V1) (PAG1,V1) (PAG1,V2) (PAG1,V1) (PAG1,V2) (PAG1,V1) (PAG2,V2) (PAG1,V3)

Working with a SQL command into Spark SQL with this code:

import sqlContext.implicits._ case class Log(page: String, visitor: String) val logs = data.map(p => Log(p._1,p._2)).toDF() logs.registerTempTable("logs") val sqlResult= sqlContext.sql(                               """select page                                        ,count(distinct visitor) as visitor                                    from logs                                group by page                               """) val result = sqlResult.map(x=>(x(0).toString,x(1).toString)) result.foreach(println)

I get this output:

(PAG1,3) // PAG1 has been visited by 3 different visitors (PAG2,2) // PAG2 has been visited by 2 different visitors

Now, I would like to get the same result using Dataframes and thiers API, but I can't get the same output:

import sqlContext.implicits._ case class Log(page: String, visitor: String) val logs = data.map(p => Coppia(p._1,p._2)).toDF() val result = log.select("page","visitor").groupBy("page").count().distinct result.foreach(println)

In fact, that's what I get as output:

[PAG1,8]  // just the simple page count for every page [PAG2,4]

214

asked May 13 '15 14:05

Fabio Fantoni

Video Answer

1 Answers

What you need is the DataFrame aggregation function countDistinct:

import sqlContext.implicits._ import org.apache.spark.sql.functions._  case class Log(page: String, visitor: String)  val logs = data.map(p => Log(p._1,p._2))             .toDF()  val result = logs.select("page","visitor")             .groupBy('page)             .agg('page, countDistinct('visitor))  result.foreach(println)

179

answered Sep 30 '22 08:09

yjshen

Related questions
                            
                                Count distinct value pairs in multiple columns in SQL
                            
                                MySQL distinct count if conditions unique
                            
                                How to make use of SQL (Oracle) to count the size of a string?
                            
                                Count consecutive characters
                            
                                Why is this C code faster than this C++ code ? getting biggest line in file
                            
                                Counting palindromic substrings in O(n)
                            
                                Oracle row count of table by count(*) vs NUM_ROWS from DBA_TABLES
                            
                                MySQL Query with count and group by
                            
                                Elegant way of counting occurrences in a java collection
                            
                                How to count the number of elements that match a condition with LINQ
                            
                                Unix Command to get the count of lines in a csv file
                            
                                Counting Using Group By Linq
                            
                                Identify if at least one row with given condition exists
                            
                                Cumulative count of each value [duplicate]
                            
                                SPARQL query and distinct count
                            
                                pandas python how to count the number of records or rows in a dataframe
                            
                                What's the equivalent of Panda's value_counts() in PySpark?
                            
                                Counting frequency of values by date using pandas
                            
                                JPA COUNT with composite primary key query not working
                            
                                Count multiple columns with group by in one query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: How to translate count(distinct(value)) in Dataframe API's

Tags:

dataframe

count

distinct

apache-spark

apache-spark-sql

Fabio Fantoni

People also ask

Video Answer

1 Answers

yjshen

Recent Activity

Donate For Us