Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with
df = SqlContext.table(mytable)
does a select
and subsequent cache
improves performance/decrease memory usage, like
df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()
or is just waste of time? I will do lots of operations and data manipulations on df
, like avg
, withColumn
, etc.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
To SELECT particular columns using the select option in PySpark Data Frame. This selects the ID Column From the DATA FRAME. The same can be done by aliasing the Data Frame. Using the DataFrame.
Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function.
IMO it makes sense to filter them beforehand:
df = SqlContext.sql('select col_1, col_2, col_3 from mytable')
so you won't waste resources...
If you can't do it this way, then you can do it as you did it...
It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.
Also it is worth noting that using df.count()
after df.cache()
will be useless most of the time (if not always). In general count
is rewritten by the optimizer as
SELECT SUM(1) FROM table
so what is typically requested from the source is:
SELECT 1 FROM table
Long story short there is nothing useful to cache here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With