If I connect to a Spark cluster, copy some data to it, and disconnect, ...
library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)
then the next time I connect to Spark, the data is not there.
sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)
This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.
How do I persist data in the Spark cluster between connections?
I thought sdf_persist()
might be what I want, but it appears not.
Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.
https://en.wikipedia.org/wiki/Apache_Spark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With