spark possible to split dataframe into parts for topandas

Tags:

I have a 10million record dataframe. My requirement is I need to do some operations on this data in pandas, and I do not have the memory for all 10million records to be in pandas at once. So I want to be able to chunk it and use toPandas on each chunk

df = sqlContext.sql("select * from db.table")
#do chunking to take X records at a time
#how do I generated chunked_df?
p_df = chunked_df.toPandas()
#do things to p_df

How do I chunk my dataframe into either equal x-parts or into parts by record count, say 1 million at a time. Either solution is acceptable, I just need to process it in smaller chunks.

543

asked Oct 26 '18 15:10

test acc

1 Answers

One option is to use toLocalIterator in conjunction with repartition and mapPartitions.

import pandas as pd

columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [pd.DataFrame(list(iterator), columns=columns)]).toLocalIterator()
for pdf in chunks:
    # do work locally on chunk as pandas df

By using toLocalIterator, only one partition at a time is collected to the driver.

Another option, which in my opinion is preferable, is to distribute your work across the cluster on the pandas chunks in each partition. This can be achieved using pandas_udf:

from pyspark.sql.functions import spark_partition_id, pandas_udf, PandasUDFType

@pandas_udf(result_schema, PandasUDFType.GROUPED_MAP)
def transform_pandas_df_chunk(pdf):
    result_pdf = ...
    # do ditributed work on a chunk of the original spark dataframe as a pandas dataframe
    return result_pdf

spark_df = spark_df.repartition(num_chunks).groupby(spark_partition_id()).apply(transform_pandas_df_chunk)

141

answered Sep 22 '22 00:09

Zohar Meir

Related questions
                            
                                Add transparent picture over plot
                            
                                Cant Pickle memoized class instance
                            
                                "ImportError: Failed to load GLFW3 shared library" without root access on Linux
                            
                                How does shuffling work with ImageDataGenerator in Machine Learning?
                            
                                How to model a shared layer in keras?
                            
                                sigmoid_cross_entropy loss function from tensorflow for image segmentation
                            
                                Python 3.5 string format: How to add a thousands-separator and also right justify?
                            
                                How to duplicate a specific value in a list/array?
                            
                                single element in a list
                            
                                Django initialize data test for all test classes
                            
                                Store filtered output of cmd command in a variable
                            
                                TypeError: 'dict_items' object is not subscriptable on running if statement to shortlist items
                            
                                OneHotEncoder - encoding only some of categorical variable columns
                            
                                Python seaborn catplot - How do I change the y-axis scale to percentage
                            
                                Force password authentication (ignore keys in .ssh folder) in Paramiko in Python
                            
                                Clustering images using unsupervised Machine Learning
                            
                                Python pyodbc.row to list
                            
                                Using deprecated Numpy API
                            
                                Loading numpy array from http response without saving a file
                            
                                Problem with saving spark DataFrame as Hive table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark possible to split dataframe into parts for topandas

Tags:

python

pandas

apache-spark

test acc

People also ask

1 Answers

Zohar Meir

Recent Activity

Donate For Us