SparkR

Question

I'm exploring SparkR to compute statistics like quantiles, mean, frequency of categories (source file is in Amazon S3 - csv format).

I'm able to parse the csv file and create a dataframe. However, I'm not able to use this spark-dataframe with standard R functions like, quantile(), mean() etc.

For an example, here is the R dataframe 'test'

> test <- data.frame(x=c(26,21,20),y=c(34,29,28))
> quantile ( test$x )
  0%  25%  50%  75% 100% 
20.0 20.5 21.0 23.5 26.0

Above dataframe produces right result. However, the dataframe created via read.df() isn't working with quantile() function.

> myDf = read.df(sqlContext, "s3n://path/s3file.csv", , source="com.databricks.spark.csv")
> quantile ( myDf$column1 )
Warning messages:
1: In is.na(<S4 object of class "Column">) :
  is.na() applied to non-(list or vector) of type 'S4'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'S4'
Error in x[order(x, na.last = na.last, decreasing = decreasing)] : 
  error in evaluating the argument 'i' in selecting a method for function '[': Error in x[!nas] : object of type 'S4' is not subsettable

My question is simple, is there anyway use the SparkR's dataframe with the native R functions? Or how to convert SparkR dataframe into a vector.

Thanks in advance.

Wannes Rosiers · Accepted Answer

There is no way to apply native R functions on SparkR DataFrames. The easiest way is to make your DataFrame local by

localDf <- collect(myDf)

On this data.frame you can apply native R functions, but not in a distributed way. When you did alter your localDf to localDf2 with native R functions, you could convert it back into a SparkR DataFrame with

myDf2 <- createDataFrame(sqlContext, localDF2)

SparkR - Convert dataframe into Vector

Tags:

r

apache-spark-sql

devsathish

1 Answers

Wannes Rosiers

Recent Activity

Donate For Us