I need to proceed distributed calculation on Spark DataFrame invoking some arbitrary (not SQL) logic on chunks of DataFrame. I did: <pre class="prettyprint"><code>def some_func(df_chunk): pan_df = df_chunk.toPandas() #whatever logic here df = sqlContext.read.parquet(...) result = df.mapPartitions(some_func) </code></pre> Unfortunatelly it leads to: <blockquote> AttributeError: 'itertools.chain' object has no attribute 'toPandas' </blockquote> I expected to have spark DataFrame object within each map invocation, instead I got 'itertools.chain'. Why? And how to overcome this?

Try this: <pre class="prettyprint"><code>>>> columns = df.columns >>> df.rdd.mapPartitions(lambda iter: [pd.DataFrame(list(iter), columns=columns)]) </code></pre>

Spark DataFrame mapPartitions

Tags:

python

apache-spark

apache-spark-sql

pyspark

I need to proceed distributed calculation on Spark DataFrame invoking some arbitrary (not SQL) logic on chunks of DataFrame. I did:

def some_func(df_chunk):
    pan_df = df_chunk.toPandas()
    #whatever logic here

df = sqlContext.read.parquet(...)
result = df.mapPartitions(some_func)

Unfortunatelly it leads to:

AttributeError: 'itertools.chain' object has no attribute 'toPandas'

I expected to have spark DataFrame object within each map invocation, instead I got 'itertools.chain'. Why? And how to overcome this?

741

asked Aug 03 '16 16:08

Иван Судос

1 Answers

Try this:

>>> columns = df.columns
>>> df.rdd.mapPartitions(lambda iter: [pd.DataFrame(list(iter), columns=columns)])

answered Oct 27 '22 20:10

user6022341

Related questions
                            
                                Python ftplib Optimal Block Size?
                            
                                "localhost" vs "127.0.0.1" performance
                            
                                DataFrame.interpolate() extrapolates over trailing missing data
                            
                                Debug C-library from Python (ctypes)
                            
                                Can tests with pytest fixtures be run interactively?
                            
                                How to install a dependency from a submodule in Python?
                            
                                Modify function in decorator
                            
                                Is there a tool to automatically calculate Big-O complexity for a function [duplicate]
                            
                                Scrapy spider memory leak
                            
                                Pythonic and efficient way to do an elementwise "in" using numpy
                            
                                Why 2700 records (320KB each) should take 30 seconds to be fetched?
                            
                                Python 3.5 type hinting dynamically generated instance attributes
                            
                                What exactly does 'use_idf' do when creating a TfidfTransformer in sklearn?
                            
                                When and why socket.send() returns 0 in python?
                            
                                Python import fails on travisCI but not locally
                            
                                Why do I get a Keras LSTM RNN input_shape error?
                            
                                Is it OK to print to stdout or stderr in Django data migrations? If so, how?
                            
                                How to find the nearest neighbors for latitude and longitude point on python?
                            
                                Passing a numpy array to C++
                            
                                How can I get my assertions in pytest to stop being abbreviated with ellipsis?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With