alternative to copy_to in sparklyr for large data sets

Question

I have the code below that takes a dataset does a SQL transformation on it using a wrapper function calling the spark SQL API using Sparklyr. I then use "invoke("createOrReplaceTempView", "name")" to save the table in the Spark environment as a spark data frame so that I can call is in a future function call. Then I use dplyr code "mutate" to call a hive function "regexp_replace" to transform letters to numeric (0). I them need to call an SQL function again.

However to do so I seem to have to use the "copy_to" function from sparklyr. On a large data set the "copy_to" function causes the following error:

Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Total size of serialized results of 6 tasks (1024.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)

Is there an alternative to "copy_to" that allows me to get a spark data frame that I can then call with SQL API?

Here is my codeL

 sqlfunction <- function(sc, block) {
  spark_session(sc) %>% invoke("sql", block)
 } 

sqlfunction(sc, "SELECT * FROM 
test")%>%invoke("createOrReplaceTempView", 
"name4")

names<-tbl(sc, "name4")%>% 
  mutate(col3 = regexp_replace(col2, "^[A-Z]", "0"))

copy_to(sc, names, overwrite = TRUE)

sqlfunction(sc, "SELECT * FROM 
test")%>%invoke("createOrReplaceTempView", 
"name5")

final<-tbl(sc, "name5")

kevinykuo · Accepted Answer

It'd help if you had a reprex, but try

final <- names %>%
  spark_dataframe() %>%
  sqlfunction(sc, "SELECT * FROM test") %>%
  sdf_register("name5")

alternative to copy_to in sparklyr for large data sets

Tags:

r

apache-spark-sql

hive

sparklyr

Levi Brackman

1 Answers

kevinykuo

Recent Activity

Donate For Us

alternative to copy_to in sparklyr for large data sets

Tags:

r

apache-spark-sql

hive

sparklyr

Levi Brackman

1 Answers

kevinykuo

Related questions

Recent Activity

Donate For Us