Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple command for extracting column names in sparklyr (R+spark)

In base r, it is easy to extract the names of columns (variables) from a data frame

> testdf <- data.frame(a1 = rnorm(1e5), a2 = rnorm(1e5), a3 = rnorm(1e5), a4 = rnorm(1e5), a5 = rnorm(1e5), a6 = rnorm(1e5))  
> names(testdf)  
[1] "a1" "a2" "a3" "a4" "a5" "a6"

but while using sparklyr, things become more complicated. After copying the data frame to spark,

> testdf_tbl <- copy_to(sc, testdf, overwrite = TRUE)  
> names(testdf_tbl)  
[1] "src" "ops"

the variable names actually reside deep inside 'ops'

> testdf_tbl$ops$vars  
[1] "a1" "a2" "a3" "a4" "a5" "a6"

and if this were all, there would be no problems (and no need to ask this question). But, every time an operation happens on testdf_tbl, the names of the columns/variables change their position, as shown below..

> testdf_tbl <- testdf_tbl %>% select(-a1)  
> testdf_tbl$ops$vars  
NULL  
> testdf_tbl$ops$x$vars  
[1] "a1" "a2" "a3" "a4" "a5" "a6"  

another operations adds another $x to the path.. and so on.

> testdf_tbl <- testdf_tbl %>% select(-a2)  
> testdf_tbl$ops$x$vars  
NULL  
> testdf_tbl$ops$x$x$vars  
[1] "a1" "a2" "a3" "a4" "a5" "a6"  

To make matters worse, the list of variables does not reflect the select operations we have made, they still list a1, a2 as column names. where as,

> head(testdf_tbl)  
Source:   query [?? x 4]  
Database: spark connection master=local[24] app=sparklyr local=TRUE  
        a3           a4          a5         a6  
        dbl          dbl         dbl        dbl  
1 -1.146368875  1.691698406  0.43231629  1.3349111  
2  0.664928710 -1.332242020  0.05380729  1.0139253  
3  1.158095695 -0.097098980 -0.61885204  0.1504693  
4  0.001595841 -0.003765908  0.27935192 -0.3039085  
5 -0.133446040  0.269329076  1.57210274  1.7762602  
6  0.006468698 -1.300439537  0.74057307  0.1320428  

so clearly, the select operations have had an effect is terms of how the spark dataframe is used.

SURELY, there is a simple, straightforward way to extract the current names of variables/columns in sparklyr, a la names() in base r.

like image 216
Prasanna Avatar asked Oct 11 '16 13:10

Prasanna


1 Answers

As Kevin said, tbl_vars works, but if you want it to be more "base-R" like, colnames also does it.

like image 179
Josh Magarick Avatar answered Oct 02 '22 11:10

Josh Magarick