I'm now learning dplyr
package in R, but hit the wall in understanding what the three functions - compute
, collect
, and collapse
- do.
I understand that the dplyr
doesn't use the type data.frame
internally; it instead stores its data into its own type tbl
or tbl_df
.
Then, in order to convert the custom type back to R's default data.frame
to utilize a set of default functions on data.frame
, you must use collect
, such as:
batting <- tbl(lahman_sqlite(), "Batting")
dim(collect(batting))
This returns [1] 99846 22
as of 2016, while dim(batting)
returns [1] NA 22
.
However, I'm not sure what the other two functions - compute
and collapse
- do. If you check it out by ?collect
, the docs said the following:
Description:
‘compute’ forces computation of lazy tbls, leaving data in the remote source. ‘collect’ also forces computation, but will bring data back into an R data.frame (stored in a ‘tbl_df’). ‘collapse’ doesn't force computation, but collapses a complex tbl into a form that additional restrictions can be placed on.
What does this mean, specifically forces computation of lazy tlbs?
I would like to know what these functions do, and would like to get a clarification of what one does and the others don't.
collect: Collects all the elements of a SparkDataFrame and coerces them into an R data. frame.
12.10 group_by() The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable.
The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.
From one of the dplyr vignettes:
There are three ways to force the computation of a query:
collect()
executes the query and returns the results to R.
compute()
executes the query and stores the results in a temporary table in the database.
collapse()
turns the query into a table expression.
collect()
is the function you’ll use most. Once you reach the set of operations you want, you usecollect()
to pull the data into a local tbl_df(). If you know SQL, you can usecompute()
andcollapse()
to optimise performance.
If that's not helpful, your best bet is probably studying the source code of each function. You can see instructions on how to do that here: How do I see the help for the `dplyr::collect` method?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With