I'm new to Spark, SparkR and generally all HDFS-related technologies. I've installed recently Spark 1.5.0 and run some simple code with SparkR: <pre class="prettyprint"><code>Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6") .libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib") require('SparkR') require('data.table') sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) hiveContext <- sparkRHive.init(sc) n = 1000 x = data.table(id = 1:n, val = rnorm(n)) Sys.time() xs <- createDataFrame(sqlContext, x) Sys.time() </code></pre> The code executes immediately. However when I change it to <code>n = 1000000</code> it takes about 4 minutes (time between two <code>Sys.time()</code> calls). When I check these jobs in console on port :4040, job for <code>n = 1000</code> has duration 0.2s, and job for <code>n = 1000000</code> 0.3s. Am I doing something wrong?

You're not doing anything particularly wrong. It is just an effect of a combination of different factors: <ol> <li> <code>createDataFrame</code> as it is currently (Spark 1.5.1) implemented is slow. It is a known issue described in SPARK-8277.</li> <li>Current implementation doesn't play well with <code>data.table</code>.</li> <li>Base R is relatively slow. Smart people say it is a feature not a bug but it is still something to consider.</li> </ol> Until SPARK-8277 is resolved there is not much you can do but there two options you can try: <ul> <li> use plain old <code>data.frame</code> instead of <code>data.table</code>. Using flights dataset (227496 rows, 14 columns): <pre class="prettyprint"><code>df <- read.csv("flights.csv") microbenchmark::microbenchmark(createDataFrame(sqlContext, df), times=3) ## Unit: seconds ## expr min lq mean median ## createDataFrame(sqlContext, df) 96.41565 97.19515 99.08441 97.97465 ## uq max neval ## 100.4188 102.8629 3 </code></pre> compared to <code>data.table</code> <pre class="prettyprint"><code>dt <- data.table::fread("flights.csv") microbenchmark::microbenchmark(createDataFrame(sqlContext, dt), times=3) ## Unit: seconds ## expr min lq mean median ## createDataFrame(sqlContext, dt) 378.8534 379.4482 381.2061 380.043 ## uq max neval ## 382.3825 384.722 3 </code></pre> </li> <li> Write to disk and use <code>spark-csv</code> to load data directly to Spark DataFrame without direct interaction with R. As crazy as it sounds: <pre class="prettyprint"><code>dt <- data.table::fread("flights.csv") write_and_read <- function() { write.csv(dt, tempfile(), row.names=FALSE) read.df(sqlContext, "flights.csv", source = "com.databricks.spark.csv", header = "true", inferSchema = "true" ) } ## Unit: seconds ## expr min lq mean median ## write_and_read() 2.924142 2.959085 2.983008 2.994027 ## uq max neval ## 3.01244 3.030854 3 </code></pre> </li> </ul> I am not really sure if really it makes sense to push data that can be handled in R to Spark in the first place but lets not dwell on that. Edit: This issue should be resolved by SPARK-11086 in Spark 1.6.0.

SparkR bottleneck in createDataFrame?

Tags:

r

apache-spark

sparkr

I'm new to Spark, SparkR and generally all HDFS-related technologies. I've installed recently Spark 1.5.0 and run some simple code with SparkR:

Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6")
.libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib")
require('SparkR')
require('data.table')

sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)

n = 1000
x = data.table(id = 1:n, val = rnorm(n))

Sys.time()
xs <- createDataFrame(sqlContext, x)
Sys.time()

The code executes immediately. However when I change it to n = 1000000 it takes about 4 minutes (time between two Sys.time() calls). When I check these jobs in console on port :4040, job for n = 1000 has duration 0.2s, and job for n = 1000000 0.3s. Am I doing something wrong?

946

asked Oct 01 '15 12:10

Krzysztof Jędrzejewski

1 Answers

You're not doing anything particularly wrong. It is just an effect of a combination of different factors:

createDataFrame as it is currently (Spark 1.5.1) implemented is slow. It is a known issue described in SPARK-8277.
Current implementation doesn't play well with data.table.
Base R is relatively slow. Smart people say it is a feature not a bug but it is still something to consider.

Until SPARK-8277 is resolved there is not much you can do but there two options you can try:

use plain old data.frame instead of data.table. Using flights dataset (227496 rows, 14 columns):

df <- read.csv("flights.csv")
microbenchmark::microbenchmark(createDataFrame(sqlContext, df), times=3)

## Unit: seconds
##                             expr      min       lq     mean   median
##  createDataFrame(sqlContext, df) 96.41565 97.19515 99.08441 97.97465
##        uq      max neval
##  100.4188 102.8629     3

compared to data.table

dt <- data.table::fread("flights.csv")
microbenchmark::microbenchmark(createDataFrame(sqlContext, dt), times=3)

## Unit: seconds        
##                             expr      min       lq     mean  median
##  createDataFrame(sqlContext, dt) 378.8534 379.4482 381.2061 380.043
##        uq     max neval
##  382.3825 384.722     3

Write to disk and use spark-csv to load data directly to Spark DataFrame without direct interaction with R. As crazy as it sounds:

dt <- data.table::fread("flights.csv")

write_and_read <- function() {
    write.csv(dt, tempfile(), row.names=FALSE)
    read.df(sqlContext, "flights.csv",
        source = "com.databricks.spark.csv",
        header = "true",
        inferSchema = "true"
    )
}

## Unit: seconds
##              expr      min       lq     mean   median
##  write_and_read() 2.924142 2.959085 2.983008 2.994027
##       uq      max neval
##  3.01244 3.030854     3

I am not really sure if really it makes sense to push data that can be handled in R to Spark in the first place but lets not dwell on that.

Edit:

This issue should be resolved by SPARK-11086 in Spark 1.6.0.

answered Sep 22 '22 17:09

zero323

Related questions
                            
                                lpSolve in R with Character and Column Sum Contraints
                            
                                Delete lines of a text file in R
                            
                                Shinydashboard Tabbox Height
                            
                                R Shiny: Reuse lengthy computation for different output controls
                            
                                ggplot2: aspect.ratio overpowers coord_equal or coord.fixed
                            
                                R ggvis multiple plots from single data frame
                            
                                save image of d3heatmap in a file
                            
                                data.table: Mark before/after occurrence of symbol within groups
                            
                                How to use back reference with stringi package?
                            
                                Baseline alignment of axis labels
                            
                                Missing value error in the randomForest package of R
                            
                                How to read .HGT files in R
                            
                                Turn list of company names into tickers
                            
                                Cumulative sum based on certain conditions
                            
                                How to duplicate columns in a data frame
                            
                                convert row values into column names in r
                            
                                Plotting rectangles in ggplot2 - Invalid input: time_trans works with objects of class POSIXct only
                            
                                How to load csv file into SparkR on RStudio?
                            
                                RStan: Specifying a Three-Level Random Slopes Model?
                            
                                Fixing maps library data for Pacific centred (0°-360° longitude) display

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With