Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sparklyr - Unable to copy data.frames into Spark using copy_to

Tags:

I'm trying to copy a big dataframe (around 5.8 million records) into Spark using Sparklyr's function copy_to.

First, when loading the data using fread (data.table), and applying the copy_to function, I got the following output error:

Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""integer64"" to a data.frame

Then, I changed the only two columns of type integer64 into character, then applied the as.data.frame (it's a data.table, since I used fread) into all data.

Using copy_to again, it takes a long time before and after a progress bar shows up, but the following result is returned:

Error in invoke_method.spark_shell_connection(sc, TRUE, class, method, : No status is returned. Spark R backend might have failed.

No data is copied into Spark.

Any thoughts?

like image 226
Igor Avatar asked Jul 05 '17 14:07

Igor


1 Answers

I've run into this. Unfortunately copying dataframes from memory into Sparklyr just isn't the best way to import larger data. It works better through-and-through when I save my dataframe to disk as a .csv then read that into Spark directly.

For peak performance these best thing is to save it into parquet format on disk and read for that. Because Spark works using DAGs, if you have a more efficient on-disk data format for Spark to conduct operations on, your entire Spark operation will be faster when you hit collect, insert or whathaveyou.

like image 118
Zafar Avatar answered Oct 11 '22 13:10

Zafar