Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complete dataframe in sparklyr

I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do:

data <- tibble(
  "id" = c(1,1,2,2),
  "dates" = c("2020-01-01", "2020-01-03", "2020-01-01", "2020-01-03"),
  "values" = c(3,4,7,8))

# A tibble: 4 x 3
     id dates      values
  <dbl> <chr>       <dbl>
1     1 2020-01-01      3
2     1 2020-01-03      4
3     2 2020-01-01      7
4     2 2020-01-03      8

data %>% 
  mutate(dates = as_date(dates)) %>% 
  group_by(id) %>% 
  complete(dates = seq.Date(min(dates), max(dates), by="day"))

# A tibble: 6 x 3
# Groups:   id [2]
     id dates      values
  <dbl> <date>      <dbl>
1     1 2020-01-01      3
2     1 2020-01-02     NA
3     1 2020-01-03      4
4     2 2020-01-01      7
5     2 2020-01-02     NA
6     2 2020-01-03      8

However the complete function does not exist in sparklyr.

data_spark %>% 
  mutate(dates = as_date(dates)) %>% 
  group_by(id) %>% 
  complete(dates = seq.Date(min(dates), max(dates), by="day"))

Error in UseMethod("complete_") : 
no applicable method for 'complete_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Is there a way to set a UDF or to achieve a similar result?

Thank you

like image 443
Marco De Virgilis Avatar asked May 25 '26 15:05

Marco De Virgilis


1 Answers

Under the hood tidyr::complete just performs a full join followed by optional NA fill. You can replicate its effects by using sdf_copy_to to create a new sdf that is just a single column seq.Date between your start and end date, and then perform a full_join between that and your dataset.

like image 200
bcarlsen Avatar answered May 27 '26 06:05

bcarlsen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!