I'm trying to copy a big dataframe
(around 5.8 million records) into Spark using Sparklyr's function copy_to
.
First, when loading the data using fread
(data.table
), and applying the copy_to
function, I got the following output error:
Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""integer64"" to a data.frame
Then, I changed the only two columns of type integer64
into character
, then applied the as.data.frame
(it's a data.table
, since I used fread
) into all data.
Using copy_to
again, it takes a long time before and after a progress bar shows up, but the following result is returned:
Error in invoke_method.spark_shell_connection(sc, TRUE, class, method, : No status is returned. Spark R backend might have failed.
No data is copied into Spark.
Any thoughts?
I've run into this. Unfortunately copying dataframes from memory into Sparklyr just isn't the best way to import larger data. It works better through-and-through when I save my dataframe to disk as a .csv
then read that into Spark directly.
For peak performance these best thing is to save it into parquet format on disk and read for that. Because Spark works using DAGs, if you have a more efficient on-disk data format for Spark to conduct operations on, your entire Spark operation will be faster when you hit collect, insert or whathaveyou.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With