Calling <code>collect()</code> on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that. Will <code>collect()</code> behave the same way if called on a dataframe? What about the <code>select()</code> method? Does it also work the same way as <code>collect()</code> if called on a dataframe?

Actions vs Transformations <blockquote> <ul> <li>Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.</li> </ul> </blockquote> spark-sql doc <blockquote> select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame. Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.** <pre class="prettyprint"><code>df.select('*').collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] df.select('name', 'age').collect() [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] df.select(df.name, (df.age + 10).alias('age')).collect() [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)] </code></pre> </blockquote> Execution <code>select(column-name1,column-name2,etc)</code> method on a dataframe, returns a new dataframe which holds only the columns which were selected in the <code>select()</code> function. e.g. assuming <code>df</code> has several columns including "name" and "value" and some others. <pre class="prettyprint"><code>df2 = df.select("name","value") </code></pre> <code>df2</code> will hold only two columns ("name" and "value") out of the entire columns of <code>df</code> df2 as the result of <code>select</code> will be in the executors and not in the driver (as in the case of using <code>collect()</code>) sql-programming-guide <pre class="prettyprint"><code>df.printSchema() # root # |-- age: long (nullable = true) # |-- name: string (nullable = true) # Select only the "name" column df.select("name").show() # +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+ </code></pre> You can running <code>collect()</code> on a dataframe (spark docs) <pre class="prettyprint"><code>>>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] >>> spark.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] </code></pre> spark docs <blockquote> To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println). </blockquote>

calling <code>select</code> will result is <code>lazy</code> evaluation: for example: <pre class="prettyprint"><code>val df1 = df.select("col1") val df2 = df1.filter("col1 == 3") </code></pre> both above statements create lazy path that will be executed when you call action on that <code>df</code>, such as <code>show</code>, <code>collect</code> etc. <pre class="prettyprint"><code>val df3 = df2.collect() </code></pre> use <code>.explain</code> at the end of your transformation to follow its plan here is more detailed info Transformations and Actions

Spark dataframe: collect () vs select ()

2 Answers

Actions vs Transformations

Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

spark-sql doc

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**
df.select('*').collect() [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] df.select('name', 'age').collect() [Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] df.select(df.name, (df.age + 10).alias('age')).collect() [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)] 

Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.

e.g. assuming df has several columns including "name" and "value" and some others.

df2 = df.select("name","value")

df2 will hold only two columns ("name" and "value") out of the entire columns of df

df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())

sql-programming-guide

df.printSchema() # root # |-- age: long (nullable = true) # |-- name: string (nullable = true)  # Select only the "name" column df.select("name").show() # +-------+ # |   name| # +-------+ # |Michael| # |   Andy| # | Justin| # +-------+

You can running collect() on a dataframe (spark docs)

>>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] >>> spark.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)]

spark docs

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

196

answered Oct 13 '22 22:10

Yaron

calling select will result is lazy evaluation: for example:

val df1 = df.select("col1") val df2 = df1.filter("col1 == 3")

both above statements create lazy path that will be executed when you call action on that df, such as show, collect etc.

val df3 = df2.collect()

use .explain at the end of your transformation to follow its plan here is more detailed info Transformations and Actions

answered Oct 13 '22 22:10

elcomendante

Related questions
                            
                                Using .loc with a MultiIndex in pandas?
                            
                                dataframe.describe() suppress scientific notation [duplicate]
                            
                                Creating a pandas DataFrame from columns of other DataFrames with similar indexes
                            
                                Max and Min date in pandas groupby
                            
                                USING LIKE inside pandas.query()
                            
                                Error - replacement has [x] rows, data has [y]
                            
                                Count occurrences of False or True in a column in pandas
                            
                                assign headers based on existing row in dataframe in R
                            
                                adding dummy columns to the original dataframe
                            
                                Filtering DataFrame using the length of a column
                            
                                Fast vectorized merge of list of data.frames by row
                            
                                Replace values in a dataframe based on lookup table
                            
                                Unimplemented type list when trying to write.table
                            
                                Replace characters from a column of a data frame R
                            
                                _corrupt_record error when reading a JSON file into Spark
                            
                                Create an ID (row number) column
                            
                                How do I check for equality using Spark Dataframe without SQL Query?
                            
                                Add new column in Pandas DataFrame Python [duplicate]
                            
                                Find unique values in a Pandas dataframe, irrespective of row or column location
                            
                                How to add a cumulative column to an R dataframe using dplyr?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark dataframe: collect () vs select ()

Tags:

dataframe

apache-spark

apache-spark-sql

Mrinal

People also ask

2 Answers

Yaron

elcomendante

Recent Activity

Donate For Us