Question about visualization of Spark DataFrames methods.
As for now (I use v. 2.0.0) , Spark DataFrames do not have any visualization functionality (yet). Usually the solution is to collect some sample of the DataFrame into the driver, load it into, for instance, Pandas DataFrame, and use its visualization capabilities.
My question is: How do I know what is the optimal sampling size to maximally utilize the driver's memory, in order to visualize the data? Or, what is the best practice to work around this issue?
Thanks!
I don't think this will answer your question, but hopefully, it will give some perspective for others, or maybe you.
I usually aggregate on spark and then use Pandas to visualize (but do not store it to a variable). In example (simplified), I would count active users per day and then only this count collect and visualize through Pandas (when possible, I try to avoid saving data to variable):
(
spark.table("table_name")
.filter(F.col("status") == "Active")
.groupBy("dt")
.count()
.toPandas()
.plot(x="dt", y="count")
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With