Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

alternative to copy_to in sparklyr for large data sets

I have the code below that takes a dataset does a SQL transformation on it using a wrapper function calling the spark SQL API using Sparklyr. I then use "invoke("createOrReplaceTempView", "name")" to save the table in the Spark environment as a spark data frame so that I can call is in a future function call. Then I use dplyr code "mutate" to call a hive function "regexp_replace" to transform letters to numeric (0). I them need to call an SQL function again.

However to do so I seem to have to use the "copy_to" function from sparklyr. On a large data set the "copy_to" function causes the following error:

Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Total size of serialized results of 6 tasks (1024.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)

Is there an alternative to "copy_to" that allows me to get a spark data frame that I can then call with SQL API?

Here is my codeL

 sqlfunction <- function(sc, block) {
  spark_session(sc) %>% invoke("sql", block)
 } 

sqlfunction(sc, "SELECT * FROM 
test")%>%invoke("createOrReplaceTempView", 
"name4")

names<-tbl(sc, "name4")%>% 
  mutate(col3 = regexp_replace(col2, "^[A-Z]", "0"))

copy_to(sc, names, overwrite = TRUE)

sqlfunction(sc, "SELECT * FROM 
test")%>%invoke("createOrReplaceTempView", 
"name5")

final<-tbl(sc, "name5")
like image 991
Levi Brackman Avatar asked Oct 28 '25 09:10

Levi Brackman


1 Answers

It'd help if you had a reprex, but try

final <- names %>%
  spark_dataframe() %>%
  sqlfunction(sc, "SELECT * FROM test") %>%
  sdf_register("name5")
like image 139
kevinykuo Avatar answered Oct 30 '25 00:10

kevinykuo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!