Complete dataframe in sparklyr

Question

I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do:

data <- tibble(
  "id" = c(1,1,2,2),
  "dates" = c("2020-01-01", "2020-01-03", "2020-01-01", "2020-01-03"),
  "values" = c(3,4,7,8))

# A tibble: 4 x 3
     id dates      values
  <dbl> <chr>       <dbl>
1     1 2020-01-01      3
2     1 2020-01-03      4
3     2 2020-01-01      7
4     2 2020-01-03      8

data %>% 
  mutate(dates = as_date(dates)) %>% 
  group_by(id) %>% 
  complete(dates = seq.Date(min(dates), max(dates), by="day"))

# A tibble: 6 x 3
# Groups:   id [2]
     id dates      values
  <dbl> <date>      <dbl>
1     1 2020-01-01      3
2     1 2020-01-02     NA
3     1 2020-01-03      4
4     2 2020-01-01      7
5     2 2020-01-02     NA
6     2 2020-01-03      8

However the complete function does not exist in sparklyr.

data_spark %>% 
  mutate(dates = as_date(dates)) %>% 
  group_by(id) %>% 
  complete(dates = seq.Date(min(dates), max(dates), by="day"))

Error in UseMethod("complete_") : 
no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Is there a way to set a UDF or to achieve a similar result?

Thank you

bcarlsen · Accepted Answer

Under the hood tidyr::complete just performs a full join followed by optional NA fill. You can replicate its effects by using sdf_copy_to to create a new sdf that is just a single column seq.Date between your start and end date, and then perform a full_join between that and your dataset.

Complete dataframe in sparklyr

Tags:

r

dplyr

apache-spark

tidyr

sparklyr

Marco De Virgilis

1 Answers

bcarlsen

Recent Activity

Donate For Us

Complete dataframe in sparklyr

Tags:

r

dplyr

apache-spark

tidyr

sparklyr

Marco De Virgilis

1 Answers

bcarlsen

Related questions

Recent Activity

Donate For Us