Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can gather.
library(sparklyr)
library(dplyr)
library(titanic)
library(Lahman)
spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
titanic_tbl <- copy_to(sc, titanic_train, "titanic", overwrite = TRUE)
src_tbls(sc)
# [1] "batting" "titanic"
tbl_cache(sc, "batting") # Speeds up computations -- loaded into memory
src_tbls(sc)
# [1] "batting" "titanic"
tbl_uncache(sc, "batting")
src_tbls(sc)
# [1] "batting" "titanic"
To disconnect the complete sc, I would use spark_disconnect(sc)
, but in this example it would destroy both "titanic" and "batting" tables stored inside of sc.
Rather, I would like to delete e.g., "batting" with something like spark_disconnect(sc, tableToRemove = "batting")
, but this doesn't seem possible.
DROP TABLE deletes the table and removes the directory associated with the table from the file system if the table is not EXTERNAL table. If the table is not present it throws an exception. In case of an external table, only the associated metadata information is removed from the metastore database.
sparklyr translates dplyr functions like arrange() into a SQL query plan that is used by SparkSQL. This is not the case with SparkR , which has functions for SparkSQL tables and Spark DataFrames.
What is Sparklyr? Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark's capabilities in a modern R environment, due to Spark's ability to interact with distributed data with little latency.
dplyr::db_drop_table(sc, "batting")
I tried this function and it seems work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With