How does Spark DataFrame handles Pandas DataFrame that is larger than memory

Question

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.

Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

zero323 · Accepted Answer

the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:

has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)

In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.

Regarding HDF5 files you have two options:

read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

Tags:

pandas

dataframe

hdf5

apache-spark

apache-spark-sql

Michael

1 Answers

zero323

Recent Activity

Donate For Us

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

Tags:

pandas

dataframe

hdf5

apache-spark

apache-spark-sql

Michael

1 Answers

zero323

Related questions

Recent Activity

Donate For Us