Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with <pre class="prettyprint"><code>df = SqlContext.sql('select cols from mytable') </code></pre> and I'm importing the entire table with <pre class="prettyprint"><code>df = SqlContext.table(mytable) </code></pre> does a <code>select</code> and subsequent <code>cache</code> improves performance/decrease memory usage, like <pre class="prettyprint"><code>df = df.select('col_1', 'col_2', 'col_3') df.cache() df.count() </code></pre> or is just waste of time? I will do lots of operations and data manipulations on <code>df</code>, like <code>avg</code>, <code>withColumn</code>, etc.

IMO it makes sense to filter them beforehand: <pre class="prettyprint"><code>df = SqlContext.sql('select col_1, col_2, col_3 from mytable') </code></pre> so you won't waste resources... If you can't do it this way, then you can do it as you did it...

It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan. Also it is worth noting that using <code>df.count()</code> after <code>df.cache()</code> will be useless most of the time (if not always). In general <code>count</code> is rewritten by the optimizer as <pre class="prettyprint"><code>SELECT SUM(1) FROM table </code></pre> so what is typically requested from the source is: <pre class="prettyprint"><code>SELECT 1 FROM table </code></pre> Long story short there is nothing useful to cache here.

Select specific columns in a PySpark dataframe to improve performance

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with

df = SqlContext.sql('select cols from mytable')

and I'm importing the entire table with

df = SqlContext.table(mytable)

does a select and subsequent cache improves performance/decrease memory usage, like

df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()

or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.

How do I select certain columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do you select a particular column value in PySpark DataFrame?

To SELECT particular columns using the select option in PySpark Data Frame. This selects the ID Column From the DATA FRAME. The same can be done by aliasing the Data Frame. Using the DataFrame.

How do I read specific columns in PySpark?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function.

IMO it makes sense to filter them beforehand:

df = SqlContext.sql('select col_1, col_2, col_3 from mytable')

so you won't waste resources...

If you can't do it this way, then you can do it as you did it...

It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.

Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as

SELECT SUM(1) FROM table

so what is typically requested from the source is:

SELECT 1 FROM table

Long story short there is nothing useful to cache here.

Select specific columns in a PySpark dataframe to improve performance

Tags:

apache-spark

apache-spark-sql

pyspark

Ivan

People also ask

2 Answers

MaxU - stop WAR against UA

zero323

Recent Activity

Donate For Us

Select specific columns in a PySpark dataframe to improve performance

Tags:

apache-spark

apache-spark-sql

pyspark

Ivan

People also ask

2 Answers

MaxU - stop WAR against UA

zero323

Related questions

Recent Activity

Donate For Us