I'm getting some data from a Hive table and inserting on a dataframe:
df = sqlContext.table('mydb.mytable')
and I'm filtering a few values that are not useful:
df = df[df.myfield != "BADVALUE"]
I want to do this on the data frame and not as a select
query for code design reasons. I noticed that even after I filtered the dataframe, it seems that the query and load operation from Hive is done every single time I operate on the df later:
df.groupBy('myfield').mean()
This would take a very long time, exactly the same as if I didn't filter the data frame. Is there a way do to a deep copy of it to increase performance and reduce the memory footprint?
It sounds like you need to cache your dataframe
df.cache()
Spark is lazily evaluated. When you perform transformations (such as filter), spark will not actually do anything. Computations won't occur until you do an action (such as show, count, etc). And Spark will not keep any intermediate (final) results. It only keeps the steps needed to create all of the dataframes. To avoid multiple redundant steps (eg reading in the table, filtering out bad values) if you perform multiple actions, you'll need to cache the intermediate dataframe into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With