Why is collect in SparkR so slow?

Tags:

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.

To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.

Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.

I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).

Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?

489

asked Sep 19 '16 15:09

Wil Van Cleve

1 Answers

@Will

I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.

When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).

If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.

115

answered Sep 21 '22 17:09

Mohit Bansal

Related questions
                            
                                Rstudio and R terminal give different outputs
                            
                                Everytime I upgrade R using homebrew I need to install most packages again
                            
                                How to write unit tests for suggested packages?
                            
                                R - Why adding 1 column to data table nearly doubles peak memory used?
                            
                                Use list of functions with dplyr::summarize_each_
                            
                                Enforce PDF package vignette with knitr
                            
                                ggplot2 Vertical Bars Nested in Horizontal Bars [product plot]
                            
                                knitr: can I cite an article in a figure caption using the fig.cap chunk option?
                            
                                Connect R to Redshift with RPostgreSQL on Mac 10.11.3
                            
                                Add sections to R package's help/documentation
                            
                                Shiny scoping rules - where to load libraries in modular architecture
                            
                                Incorporate Leaflet map in revealjs presentation within R
                            
                                Randomly assign people into different size groups and category
                            
                                R- create temporary table in sql server from R data frame
                            
                                Get the opposite of a CIDR Range
                            
                                Control the number of arrow heads
                            
                                How do I stop merge from converting characters into factors?
                            
                                Uploading picture to the knitr document via Shiny
                            
                                using htmlwidgets::scaffoldWidget to incorporate external js libraries for a new package to go into a shiny app
                            
                                LSTM example to time series prediction via MXNet in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is collect in SparkR so slow?

Tags:

r

apache-spark

sparkr

Wil Van Cleve

People also ask

1 Answers

Mohit Bansal

Recent Activity

Donate For Us