Deep copy a filtered PySpark dataframe from a Hive query

Question

I'm getting some data from a Hive table and inserting on a dataframe:

df = sqlContext.table('mydb.mytable')

and I'm filtering a few values that are not useful:

df = df[df.myfield != "BADVALUE"]

I want to do this on the data frame and not as a select query for code design reasons. I noticed that even after I filtered the dataframe, it seems that the query and load operation from Hive is done every single time I operate on the df later:

df.groupBy('myfield').mean()

This would take a very long time, exactly the same as if I didn't filter the data frame. Is there a way do to a deep copy of it to increase performance and reduce the memory footprint?

David · Accepted Answer

It sounds like you need to cache your dataframe

df.cache()

Spark is lazily evaluated. When you perform transformations (such as filter), spark will not actually do anything. Computations won't occur until you do an action (such as show, count, etc). And Spark will not keep any intermediate (final) results. It only keeps the steps needed to create all of the dataframes. To avoid multiple redundant steps (eg reading in the table, filtering out bad values) if you perform multiple actions, you'll need to cache the intermediate dataframe into memory.

Deep copy a filtered PySpark dataframe from a Hive query

Tags:

python

apache-spark

pyspark

Ivan

1 Answers

David

Recent Activity

Donate For Us

Deep copy a filtered PySpark dataframe from a Hive query

Tags:

python

apache-spark

pyspark

Ivan

1 Answers

David

Related questions

Recent Activity

Donate For Us