I'm exploring SparkR to compute statistics like quantiles, mean, frequency of categories (source file is in Amazon S3 - csv format).
I'm able to parse the csv file and create a dataframe.
However, I'm not able to use this spark-dataframe with standard R functions like, quantile(), mean() etc.
For an example, here is the R dataframe 'test'
> test <- data.frame(x=c(26,21,20),y=c(34,29,28))
> quantile ( test$x )
0% 25% 50% 75% 100%
20.0 20.5 21.0 23.5 26.0
Above dataframe produces right result. However, the dataframe created via read.df() isn't working with quantile() function.
> myDf = read.df(sqlContext, "s3n://path/s3file.csv", , source="com.databricks.spark.csv")
> quantile ( myDf$column1 )
Warning messages:
1: In is.na(<S4 object of class "Column">) :
is.na() applied to non-(list or vector) of type 'S4'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'S4'
Error in x[order(x, na.last = na.last, decreasing = decreasing)] :
error in evaluating the argument 'i' in selecting a method for function '[': Error in x[!nas] : object of type 'S4' is not subsettable
My question is simple, is there anyway use the SparkR's dataframe with the native R functions? Or how to convert SparkR dataframe into a vector.
Thanks in advance.
There is no way to apply native R functions on SparkR DataFrames. The easiest way is to make your DataFrame local by
localDf <- collect(myDf)
On this data.frame you can apply native R functions, but not in a distributed way. When you did alter your localDf to localDf2 with native R functions, you could convert it back into a SparkR DataFrame with
myDf2 <- createDataFrame(sqlContext, localDF2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With