Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I save partial results of dataframe transformation processes in pyspark?

I am working in apache-spark to make multiple transformations on a single Dataframe with python.

I have coded some functions in order to make easier the different transformations. Imagine we have functions like:

clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or 
#udf
    return df

I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.

I noticed that everytime I add a line like below, the running time is longer:

# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)

As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1 Spark runs only this function, but when running function2 Spark runs function1 and then, function2. What if I really need to run each function only one?

I tried with a df.cache() and df.persist() but I din't get the desired results.

I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.

like image 229
Hugo Reyes Avatar asked Nov 08 '22 15:11

Hugo Reyes


1 Answers

You probably aren't getting the desired results from cache() or persist() because they won't be evaluated until you invoke an action. You can try something like this:

# Step transformation 1:
df = function1(df,column).cache()

# Now invoke an action
df.count()

# Step transformation 2.
df = function2(df, column)

To see that the execution graph changes, the SQL tab in the Spark Job UI is a particularly helpful debugging tool.

I'd also recommend checking out the ML Pipeline API and seeing if it's worth implementing a custom Transformer. See Create a custom Transformer in PySpark ML.

like image 195
Chris Dove Avatar answered Nov 14 '22 21:11

Chris Dove