Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between collect and as.data.frame in sparkR

Tags:

What is the difference between as.data.frame() and collect(), when heaving a DataFrame object into local memory?

like image 368
PeterPancake Avatar asked Jul 07 '16 16:07

PeterPancake


People also ask

What does collect () do in PySpark?

PySpark Collect() – Retrieve data from DataFrame Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is the difference between collect and show in Spark?

show() : It will show only the content of the dataframe. df. collect() : It will show the content and metadata of the dataframe. df.

What can I use instead of Spark collect?

Collect action will try to move all data in RDD/DataFrame to the machine with the driver and where it may run out of memory and crash. Instead, you can make sure that the number of items returned is sampled by calling take or takeSample , or perhaps by filtering your RDD/DataFrame.

What is the difference between SparkR and Sparklyr?

Sparklyr provides a range of functions that allow you to access the Spark tools for transforming/pre-processing data. SparkR is basically a tool for running R on Spark. In order to use SparkR, we just import it into our environment and run our code.


1 Answers

There is no difference whatsoever. Excluding argument validation Sparkr::as.data.frame is simply implemented with a single call to SparkR::collect:

setMethod("as.data.frame",
          signature(x = "DataFrame"),
          function(x, ...) {
             # Arguments validation      
          }
          collect(x)
        })
like image 137
zero323 Avatar answered Oct 03 '22 18:10

zero323