100 million customers click 100 billion times on the pages of a few web sites (let's say 100 websites). And the click stream is available to you in a large dataset.
Using the abstractions of Apache Spark, what is the most efficient way to count distinct visitors per website?
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
For counting the number of distinct rows we are using distinct(). count() function which extracts the number of distinct rows from the Dataframe and storing it in the variable named as 'row' For counting the number of columns we are using df.
It's slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
visitors.distinct().count()
would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and use the same methods as on the RDD like:
val file = ssc.textFileStream("...") file.distinct().count()
Last option is to use def countApproxDistinct(relativeSD: Double = 0.05): Long
however this is labelled as experimental, but would be significantly faster than count if relativeSD
(std deviation) is higher.
EDIT: Since you want the count per website you can just reduce on the website id, this can be done efficiently (with combiners ) since count is aggregate. If you have an RDD of website name user id tuples you can do. visitors.countDistinctByKey()
or visitors.countApproxDistinctByKey()
, once again the approx one is experimental. To use approx distinct by key you need a PairRDD
Interesting side note if you are ok with approximations and want fast results you might want to look into blinkDB made by the same people as spark amp labs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With