Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens when a spark dataframe is converted to Pandas dataframe using toPandas() method [duplicate]

I have a spark dataframe which i can convert to pandas dataframe using the

toPandas()

method available in pyspark.

I have the following queries regarding this?

  1. Does this conversion break the purpose of using spark itself(Distributed computing)?
  2. The dataset is going to be huge , so what about the speed and memory issues?
  3. If somebody can also explain ,what exactly happens with this one line of code,that would really help.

Thanks

like image 253
function Avatar asked Oct 20 '25 05:10

function


1 Answers

Yes, once toPandas is called on spark-dataframe it will get out of distributed system and new pandas dataframe will be in driver node of cluster.

And if the spark-data frame is huge and if doesnt fit into driver memory it will crash.

like image 69
WoodChopper Avatar answered Oct 21 '25 19:10

WoodChopper



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!