100 million customers click 100 billion times on the pages of a few web sites (let's say 100 websites). And the click stream is available to you in a large dataset. Using the abstractions of Apache Spark, what is the most efficient way to count distinct visitors per website?

<code>visitors.distinct().count()</code> would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and use the same methods as on the RDD like: <code>val file = ssc.textFileStream("...") file.distinct().count()</code> Last option is to use <code>def countApproxDistinct(relativeSD: Double = 0.05): Long</code> however this is labelled as experimental, but would be significantly faster than count if <code>relativeSD</code> (std deviation) is higher. EDIT: Since you want the count per website you can just reduce on the website id, this can be done efficiently (with combiners ) since count is aggregate. If you have an RDD of website name user id tuples you can do. <code>visitors.countDistinctByKey()</code> or <code>visitors.countApproxDistinctByKey()</code>, once again the approx one is experimental. To use approx distinct by key you need a PairRDD Interesting side note if you are ok with approximations and want fast results you might want to look into blinkDB made by the same people as spark amp labs.

Efficient Count Distinct with Apache Spark

1 Answers

visitors.distinct().count() would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and use the same methods as on the RDD like:

val file = ssc.textFileStream("...") file.distinct().count()

Last option is to use def countApproxDistinct(relativeSD: Double = 0.05): Long however this is labelled as experimental, but would be significantly faster than count if relativeSD (std deviation) is higher.

EDIT: Since you want the count per website you can just reduce on the website id, this can be done efficiently (with combiners ) since count is aggregate. If you have an RDD of website name user id tuples you can do. visitors.countDistinctByKey() or visitors.countApproxDistinctByKey(), once again the approx one is experimental. To use approx distinct by key you need a PairRDD

Interesting side note if you are ok with approximations and want fast results you might want to look into blinkDB made by the same people as spark amp labs.

147

answered Sep 23 '22 02:09

aaronman

Related questions
                            
                                Linq Distinct() by name for populate a dropdown list with name and value
                            
                                Produce DISTINCT values in STRING_AGG
                            
                                What's better for creating distinct data structures: HashSet or Linq's Distinct()?
                            
                                Android: Distinct and GroupBy in ContentResolver
                            
                                Select distinct by two properties in a list
                            
                                how to select rows based on distinct values of A COLUMN only
                            
                                How to execute UNION without sorting? (SQL)
                            
                                Using DISTINCT inner join in SQL
                            
                                Retrieving last record in each group from database - SQL Server 2005/2008
                            
                                Converting SELECT DISTINCT ON queries from Postgresql to MySQL
                            
                                DISTINCT clause with WHERE
                            
                                sql group by versus distinct
                            
                                Eliminating duplicate values based on only one column of the table
                            
                                DISTINCT ON in an aggregate function in postgres
                            
                                Criteria.DISTINCT_ROOT_ENTITY vs Projections.distinct
                            
                                linq distinct or group by multiple properties
                            
                                SQL - select distinct only on one column [duplicate]
                            
                                GROUP BY and COUNT in PostgreSQL
                            
                                Efficiently merge string arrays in .NET, keeping distinct values
                            
                                Can you create a simple 'EqualityComparer<T>' using a lambda expression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient Count Distinct with Apache Spark

Tags:

distinct

apache-spark

Antoine CHAMBILLE

People also ask

1 Answers

aaronman

Recent Activity

Donate For Us