Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

like image 273
Michael Avatar asked Oct 29 '15 16:10

Michael


1 Answers

the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:

  • has to be accessible on each worker, for example using distributed file system
  • file format has to support splitting (the simplest examples is plain old csv)

In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.

Regarding HDF5 files you have two options:

  • read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
  • distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy
like image 139
zero323 Avatar answered Sep 28 '22 05:09

zero323