I have the code below that takes a dataset does a SQL transformation on it using a wrapper function calling the spark SQL API using Sparklyr. I then use "invoke("createOrReplaceTempView", "name")" to save the table in the Spark environment as a spark data frame so that I can call is in a future function call. Then I use dplyr code "mutate" to call a hive function "regexp_replace" to transform letters to numeric (0). I them need to call an SQL function again.
However to do so I seem to have to use the "copy_to" function from sparklyr. On a large data set the "copy_to" function causes the following error:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Total size of serialized results of 6 tasks (1024.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
Is there an alternative to "copy_to" that allows me to get a spark data frame that I can then call with SQL API?
Here is my codeL
sqlfunction <- function(sc, block) {
spark_session(sc) %>% invoke("sql", block)
}
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name4")
names<-tbl(sc, "name4")%>%
mutate(col3 = regexp_replace(col2, "^[A-Z]", "0"))
copy_to(sc, names, overwrite = TRUE)
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name5")
final<-tbl(sc, "name5")
It'd help if you had a reprex, but try
final <- names %>%
spark_dataframe() %>%
sqlfunction(sc, "SELECT * FROM test") %>%
sdf_register("name5")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With