Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr summarise with dynamic columns

Tags:

dynamic

r

dplyr

I'm trying to use dplyr against my postgres database and am conducting a simple function. Everything works if I parse the column name directly, however I want to do this dynamically (i.e. sort through each column name from another dataframe

The problem I'm geeting is for the first two calculations, i'm getting the right results

Assume the first dynamic column is called "id"

pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string:

<SQL>
SELECT 
    COUNT(*) AS "row_count", 
    COUNT(DISTINCT id) AS "distinct_count", 
    COUNT(
        DISTINCT CASE 
            WHEN ('id' = '') THEN (NULL) 
            WHEN NOT('id' = '') THEN ('id') 
        END) AS "distinct_count_minus_blank", 
    MAX('id') AS "maxvalue", 
    MIN('id') AS "minvalue", 
    MAX(LENGTH('id')) AS "maxlength", 
    MIN(LENGTH('id')) AS "minlen"
FROM "table"

You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". Why is this and how can I fix it so it calculates on the actual column rather than the string?

like image 725
shecode Avatar asked Dec 17 '25 16:12

shecode


1 Answers

I think you should look at rlang::sym (which is imported by dplyr).

Assuming pull_table is a dataframe including id, some_numeric_variable and some_character_variable columns, you could write something like this:

xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(!!xx) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)), 
        maxvalue = max(!!yy ), 
        minvalue = min(!!yy ), 
        maxlength = max(length(!!ww)), 
        minlen = min(length(!!ww))
    )

The sym() function turn a string variable into a name, which can be unquoted inside dplyr functions with the !! operator. If you want more information, please take a look at the quasiquotation doc or this tuto.

Unfortunately, since I didn't have any tbl_sql at hand, I couldn't test it with show_query.

Side advice: don't ever name your variables "var" as var is also the variance function. I pulled my hair off many times just because this had messed up with some packages or custom functions.

like image 181
Dan Chaltiel Avatar answered Dec 19 '25 06:12

Dan Chaltiel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!