This is the simplest DataFrame I could think of. I'm using PySpark 1.6.1. <pre class="prettyprint"><code># one row of data rows = [ (1, 2) ] cols = [ "a", "b" ] df = sqlContext.createDataFrame(rows, cols) </code></pre> So the data frame completely fits in memory, has no references to any files and looks quite trivial to me. Yet when I collect the data, it uses 2000 executors: <pre class="prettyprint"><code>df.collect() </code></pre> during collect, 2000 executors are used: <pre class="prettyprint"><code>[Stage 2:===================================================>(1985 + 15) / 2000] </code></pre> and then the expected output: <pre class="prettyprint"><code>[Row(a=1, b=2)] </code></pre> Why is this happening? Shouldn't the DataFrame be completely in memory on the driver?

So I looked into the code a bit to try to figure out what was going on. It seems that <code>sqlContext.createDataFrame</code> really does not make any kind of attempt to set reasonable parameter values based on the data. Why 2000 tasks? Spark uses 2000 tasks because my data frame had 2000 partitions. (Even though it seems like clear nonsense to have more partitions than rows.) This can be seen by: <pre class="prettyprint"><code>>>> df.rdd.getNumPartitions() 2000 </code></pre> Why did the DataFrame have 2000 partitions? This happens because <code>sqlContext.createDataFrame</code> winds up using the default number of partitions (2000 in my case), irrespective of how the data is organized or how many rows it has. The code trail is as follows. In <code>sql/context.py</code>, the <code>sqlContext.createDataFrame</code> function calls (in this example): <pre class="prettyprint"><code>rdd, schema = self._createFromLocal(data, schema) </code></pre> which in turn calls: <pre class="prettyprint"><code>return self._sc.parallelize(data), schema </code></pre> And the <code>sqlContext.parallelize</code> function is defined in <code>context.py</code>: <pre class="prettyprint"><code>numSlices = int(numSlices) if numSlices is not None else self.defaultParallelism </code></pre> No check is done on the number of rows, and it is not possible to specify the number of slices from <code>sqlContext.createDataFrame</code>. How can I change how many partitions the DataFrame has? Using <code>DataFrame.coalesce</code>. <pre class="prettyprint"><code>>>> smdf = df.coalesce(1) >>> smdf.rdd.getNumPartitions() 1 >>> smdf.explain() == Physical Plan == Coalesce 1 +- Scan ExistingRDD[a#0L,b#1L] >>> smdf.collect() [Row(a=1, b=2)] </code></pre>

Why does collect() on a DataFrame with 1 row use 2000 exectors?

Tags:

This is the simplest DataFrame I could think of. I'm using PySpark 1.6.1.

# one row of data
rows = [ (1,   2) ]
cols = [ "a", "b" ]
df   = sqlContext.createDataFrame(rows, cols)

So the data frame completely fits in memory, has no references to any files and looks quite trivial to me.

Yet when I collect the data, it uses 2000 executors:

df.collect()

during collect, 2000 executors are used:

[Stage 2:===================================================>(1985 + 15) / 2000]

and then the expected output:

[Row(a=1, b=2)]

Why is this happening? Shouldn't the DataFrame be completely in memory on the driver?

316

asked Jun 21 '16 00:06

Corey

1 Answers

So I looked into the code a bit to try to figure out what was going on. It seems that sqlContext.createDataFrame really does not make any kind of attempt to set reasonable parameter values based on the data.

Why 2000 tasks?

Spark uses 2000 tasks because my data frame had 2000 partitions. (Even though it seems like clear nonsense to have more partitions than rows.)

This can be seen by:

>>> df.rdd.getNumPartitions()
2000

Why did the DataFrame have 2000 partitions?

This happens because sqlContext.createDataFrame winds up using the default number of partitions (2000 in my case), irrespective of how the data is organized or how many rows it has.

The code trail is as follows.

In sql/context.py, the sqlContext.createDataFrame function calls (in this example):

rdd, schema = self._createFromLocal(data, schema)

which in turn calls:

return self._sc.parallelize(data), schema

And the sqlContext.parallelize function is defined in context.py:

numSlices = int(numSlices) if numSlices is not None else self.defaultParallelism

No check is done on the number of rows, and it is not possible to specify the number of slices from sqlContext.createDataFrame.

How can I change how many partitions the DataFrame has?

Using DataFrame.coalesce.

>>> smdf = df.coalesce(1)
>>> smdf.rdd.getNumPartitions()
1
>>> smdf.explain()
== Physical Plan ==
Coalesce 1
+- Scan ExistingRDD[a#0L,b#1L]
>>> smdf.collect()
[Row(a=1, b=2)]

125

answered Sep 28 '22 02:09

Corey

Related questions
                            
                                javascript add and remove multiple value from a list in and from specific index
                            
                                What is exact connection between BITCODE_ENABLE and dylib framework?
                            
                                Method to change font size with switch statement
                            
                                SQLCipher along with DBFlow
                            
                                What does the "glGenSyncTokenCHROMIUM" error mean?
                            
                                is it possible to angular 2 html file, instead of component?
                            
                                lerna + webpack + babel-loader issue
                            
                                Google Drive API Application data accessing credentials from Web browser(Javascript)
                            
                                Google App Engine with Cloud Storage not Working Locally
                            
                                Perforce workaround to add files with '@'
                            
                                Use ng-blur on md-autocomplete?
                            
                                Difference between Proxy Service and API Service in wso2 Esb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does collect() on a DataFrame with 1 row use 2000 exectors?

Tags:

Corey

People also ask

1 Answers

Corey

Recent Activity

Donate For Us