Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deep copy a filtered PySpark dataframe from a Hive query

I'm getting some data from a Hive table and inserting on a dataframe:

df = sqlContext.table('mydb.mytable')

and I'm filtering a few values that are not useful:

df = df[df.myfield != "BADVALUE"]

I want to do this on the data frame and not as a select query for code design reasons. I noticed that even after I filtered the dataframe, it seems that the query and load operation from Hive is done every single time I operate on the df later:

df.groupBy('myfield').mean()

This would take a very long time, exactly the same as if I didn't filter the data frame. Is there a way do to a deep copy of it to increase performance and reduce the memory footprint?

like image 753
Ivan Avatar asked Mar 13 '23 06:03

Ivan


1 Answers

It sounds like you need to cache your dataframe

df.cache()

Spark is lazily evaluated. When you perform transformations (such as filter), spark will not actually do anything. Computations won't occur until you do an action (such as show, count, etc). And Spark will not keep any intermediate (final) results. It only keeps the steps needed to create all of the dataframes. To avoid multiple redundant steps (eg reading in the table, filtering out bad values) if you perform multiple actions, you'll need to cache the intermediate dataframe into memory.

like image 184
David Avatar answered Mar 15 '23 14:03

David