Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualization of data from dataframe in (Py)Spark framework

Question about visualization of Spark DataFrames methods.

As for now (I use v. 2.0.0) , Spark DataFrames do not have any visualization functionality (yet). Usually the solution is to collect some sample of the DataFrame into the driver, load it into, for instance, Pandas DataFrame, and use its visualization capabilities.

My question is: How do I know what is the optimal sampling size to maximally utilize the driver's memory, in order to visualize the data? Or, what is the best practice to work around this issue?

Thanks!

like image 693
Mike Avatar asked Oct 30 '22 12:10

Mike


1 Answers

I don't think this will answer your question, but hopefully, it will give some perspective for others, or maybe you.

I usually aggregate on spark and then use Pandas to visualize (but do not store it to a variable). In example (simplified), I would count active users per day and then only this count collect and visualize through Pandas (when possible, I try to avoid saving data to variable):

(
spark.table("table_name")
.filter(F.col("status") == "Active")
.groupBy("dt")
.count()
.toPandas()
.plot(x="dt", y="count")
)
like image 160
Paulius Baranauskas Avatar answered Nov 15 '22 10:11

Paulius Baranauskas