Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select specific columns in a PySpark dataframe to improve performance

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with

df = SqlContext.sql('select cols from mytable')

and I'm importing the entire table with

df = SqlContext.table(mytable)

does a select and subsequent cache improves performance/decrease memory usage, like

df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()

or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.

like image 503
Ivan Avatar asked Jun 21 '16 13:06

Ivan


People also ask

How do I select certain columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do you select a particular column value in PySpark DataFrame?

To SELECT particular columns using the select option in PySpark Data Frame. This selects the ID Column From the DATA FRAME. The same can be done by aliasing the Data Frame. Using the DataFrame.

How do I read specific columns in PySpark?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function.


2 Answers

IMO it makes sense to filter them beforehand:

df = SqlContext.sql('select col_1, col_2, col_3 from mytable')

so you won't waste resources...

If you can't do it this way, then you can do it as you did it...

like image 195
MaxU - stop WAR against UA Avatar answered Oct 12 '22 18:10

MaxU - stop WAR against UA


It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.

Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as

SELECT SUM(1) FROM table

so what is typically requested from the source is:

SELECT 1 FROM table

Long story short there is nothing useful to cache here.

like image 29
zero323 Avatar answered Oct 12 '22 20:10

zero323