Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of first, last, nth in sparklyr

I have looked all over and I'm still unable to get those three dplyr functions to work within sparklyr. I have a reproducible example below. First, some session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.4 (Maipo)

I am running dplyr 0.7.4, sparklyr 0.8.3, spark version 2.2.0

Here is the (desired) result of running dplyr code outside of sparklyr:

set.seed(999)

df <- data.frame(group = letters[rep(1:4, each = 2)],
                 class = letters[rep(1:4, by = 2)],
                 value = rnorm(8), stringsAsFactors = FALSE)

> df
  group class      value
1     a     a -0.9677497
2     a     b -1.1210094
3     b     c  1.3254637
4     b     d  0.1339774
5     c     a  0.9387494
6     c     b  0.1725381
7     d     c  0.9576504
8     d     d -1.3626862

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = first(class))

# A tibble: 4 x 3
  group  value class
  <chr>  <dbl> <chr>
1 a     -1.59  a    
2 b      1.07  c    
3 c     -0.843 a    
4 d     -3.15  c 

However, when I copy over that data.frame to spark, the result is not what I expect:

df <- sdf_copy_to(sc, df, "df", memory = FALSE, overwrite = TRUE)

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = first(class))

# Source:   lazy query [?? x 3]
# Database: spark_connection
  group  value class  
  <chr>  <dbl> <chr>  
1 d     -3.15  `class`
2 c     -0.843 `class`
3 b      1.07  `class`
4 a     -1.59  `class`

I also tried to see if there was a namespace issue but that did not solve this problem:

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = dplyr::first(class))

Error in x[[n]] : object of type 'builtin' is not subsettable

In my non-reproducible example I was also sometimes getting the following error depending on how I changed the code, but I haven't gotten it to show for this example.

Error in nth(x, -1L, order_by = order_by, default = default) : 
  object 'class' not found

Any help (including alternatives) would be greatly appreciated!

like image 205
Hutch3232 Avatar asked Aug 30 '25 16:08

Hutch3232


1 Answers

I had the same problem, this should work.

df %>% 
group_by(group) %>% 
summarize(value = sum(value),
          class = first_value(class))

It works good with both character or numeric columns.

By the way, I'm using dplyr 0.8.0.1 and sparklyr 0.9.4

like image 65
Ayar Paco Avatar answered Sep 02 '25 06:09

Ayar Paco