I am working in apache-spark
to make multiple transformations on a single Dataframe with python.
I have coded some functions in order to make easier the different transformations. Imagine we have functions like:
clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or
#udf
return df
I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.
I noticed that everytime I add a line like below, the running time is longer:
# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)
As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1
Spark runs only this function, but when running function2
Spark runs function1
and then, function2
. What if I really need to run each function only one?
I tried with a df.cache()
and df.persist()
but I din't get the desired results.
I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.
You probably aren't getting the desired results from cache()
or persist()
because they won't be evaluated until you invoke an action. You can try something like this:
# Step transformation 1:
df = function1(df,column).cache()
# Now invoke an action
df.count()
# Step transformation 2.
df = function2(df, column)
To see that the execution graph changes, the SQL tab in the Spark Job UI is a particularly helpful debugging tool.
I'd also recommend checking out the ML Pipeline API and seeing if it's worth implementing a custom Transformer
. See Create a custom Transformer in PySpark ML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With