Visualization of data from dataframe in (Py)Spark framework

Question

Question about visualization of Spark DataFrames methods.

As for now (I use v. 2.0.0) , Spark DataFrames do not have any visualization functionality (yet). Usually the solution is to collect some sample of the DataFrame into the driver, load it into, for instance, Pandas DataFrame, and use its visualization capabilities.

My question is: How do I know what is the optimal sampling size to maximally utilize the driver's memory, in order to visualize the data? Or, what is the best practice to work around this issue?

Thanks!

Paulius Baranauskas · Accepted Answer

I don't think this will answer your question, but hopefully, it will give some perspective for others, or maybe you.

I usually aggregate on spark and then use Pandas to visualize (but do not store it to a variable). In example (simplified), I would count active users per day and then only this count collect and visualize through Pandas (when possible, I try to avoid saving data to variable):

(
spark.table("table_name")
.filter(F.col("status") == "Active")
.groupBy("dt")
.count()
.toPandas()
.plot(x="dt", y="count")
)

Visualization of data from dataframe in (Py)Spark framework

Tags:

data-visualization

apache-spark

pyspark

spark-dataframe

Mike

1 Answers

Paulius Baranauskas

Recent Activity

Donate For Us

Visualization of data from dataframe in (Py)Spark framework

Tags:

data-visualization

apache-spark

pyspark

spark-dataframe

Mike

1 Answers

Paulius Baranauskas

Related questions

Recent Activity

Donate For Us