Take the following example data: <pre class="prettyprint"><code>set.seed(1) foo <- data.frame(x=rnorm(10, 0, 10), y=rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5))) </code></pre> I want to split the dataframe "foo" by the variable "fac" into A's and B's, apply a function (mahalanobis distance) that returns a vector of the length of each subgroup, and then mutate the output back on to the original dataframe. For example: <pre class="prettyprint"><code>auto.mahalanobis <- function(x) { temp <- x[, c("x", "y")] return(mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp, use="pairwise.complete.obs"))) } foo %>% group_by(fac) %>% mutate(mahal = auto.mahalanobis(.)) </code></pre> Which gives an error. Obviously this procedure can be done manually by splitting the dataset, applying the function, and adding the output as a column before putting it all back together again. But there must be a more efficient way to do this (perhaps this is a misuse of dplyr?).

How about making use of <code>nest</code> instead: <pre class="prettyprint"><code>foo %>% group_by(fac) %>% nest() %>% mutate(mahal = map(data, ~mahalanobis( .x, center = colMeans(.x, na.rm = T), cov = cov(.x, use = "pairwise.complete.obs")))) %>% unnest() ## A tibble: 10 x 4 # fac mahal x y # <fct> <dbl> <dbl> <dbl> # 1 A 1.02 -6.26 15.1 # 2 A 0.120 1.84 3.90 # 3 A 2.81 -8.36 -6.21 # 4 A 2.84 16.0 -22.1 # 5 A 1.21 3.30 11.2 # 6 B 2.15 -8.20 -0.449 # 7 B 2.86 4.87 -0.162 # 8 B 1.23 7.38 9.44 # 9 B 0.675 5.76 8.21 #10 B 1.08 -3.05 5.94 </code></pre> Here you avoid an explicit <code>"x"</code>, <code>"y"</code> filter of the form <code>temp <- x[, c("x", "y")]</code>, as you <code>nest</code> relevant columns after grouping by <code>fac</code>. Applying <code>mahalanobis</code> is then straight-forward. <hr> <h3>Update</h3> To respond to your comment, here is a <code>purrr</code> option. Since it's easy to loose track of what's going on, let's go step-by-step: <ol> <li> Generate sample data with one additional column. <pre class="prettyprint"><code>set.seed(1) foo <- data.frame( x = rnorm(10, 0, 10), y = rnorm(10, 0, 10), z = rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5))) </code></pre> </li> <li> We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a <code>list</code> <pre class="prettyprint"><code>cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z")) </code></pre> So we will calculate the Mahalanobis distance (per <code>fac</code>) for the subset of data in columns <code>x</code>+<code>y</code> and then separately for <code>y</code>+<code>z</code>. The names of <code>cols</code> will be used as the column names of the two distance vectors. </li> <li> Now for the actual <code>purrr</code> command: <pre class="prettyprint"><code>imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>% mutate_all(function(lst) map(lst, ~mahalanobis( .x, center = colMeans(.x, na.rm = T), cov = cov(., use = "pairwise.complete.obs")))) %>% unnest() %>% bind_cols(foo, .) # x y z fac cols1 cols2 #1 -6.264538 15.1178117 9.1897737 A 1.0197542 1.3608052 #2 1.836433 3.8984324 7.8213630 A 0.1199607 1.1141352 #3 -8.356286 -6.2124058 0.7456498 A 2.8059562 1.5099574 #4 15.952808 -22.1469989 -19.8935170 A 2.8401953 3.0675228 #5 3.295078 11.2493092 6.1982575 A 1.2141337 0.9475794 #6 -8.204684 -0.4493361 -0.5612874 B 2.1517055 1.2284793 #7 4.874291 -0.1619026 -1.5579551 B 2.8626501 1.1724828 #8 7.383247 9.4383621 -14.7075238 B 1.2271316 2.5723023 #9 5.757814 8.2122120 -4.7815006 B 0.6746788 0.6939081 #10 -3.053884 5.9390132 4.1794156 B 1.0838341 2.3328276 </code></pre> In short, we <ol> <li>loop over entries in <code>cols</code>, </li> <li> <code>nest</code> data in <code>foo</code> per <code>fac</code> based on columns defined in <code>cols</code>, </li> <li>apply <code>mahalanobis</code> on the nested and grouped data generating as many distance columns with nested data as we have entries in <code>cols</code> (i.e. subsets), and </li> <li>finally <code>unnest</code> the distance data and column-bind it to the original <code>foo</code> data. </li> </ol> </li> </ol>

You can simply do - <pre class="prettyprint"><code>foo %>% group_by(fac) %>% mutate(mahal = auto.mahalanobis(data.frame(x, y))) # A tibble: 10 x 4 # Groups: fac [2] x y fac mahal <dbl> <dbl> <fct> <dbl> 1 - 6.26 15.1 A 1.02 2 1.84 3.90 A 0.120 3 - 8.36 - 6.21 A 2.81 4 16.0 -22.1 A 2.84 5 3.30 11.2 A 1.21 6 - 8.20 - 0.449 B 2.15 7 4.87 - 0.162 B 2.86 8 7.38 9.44 B 1.23 9 5.76 8.21 B 0.675 10 - 3.05 5.94 B 1.08 </code></pre> You can remove <code>temp <- x[, c("x", "y")]</code> from your function and simply use <code>temp</code> instead of <code>x</code> as function argument. Cleaned-up function - <pre class="prettyprint"><code>auto.mahalanobis <- function(temp) { mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp, use="pairwise.complete.obs") ) } </code></pre> Btw, great job on your first post!

R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups

Tags:

Take the following example data:

set.seed(1)

foo <- data.frame(x=rnorm(10, 0, 10), y=rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5)))

I want to split the dataframe "foo" by the variable "fac" into A's and B's, apply a function (mahalanobis distance) that returns a vector of the length of each subgroup, and then mutate the output back on to the original dataframe. For example:

auto.mahalanobis <- function(x) {
  temp <- x[, c("x", "y")]
  return(mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp, 
use="pairwise.complete.obs")))
}

foo %>% group_by(fac) %>%
  mutate(mahal = auto.mahalanobis(.))

Which gives an error. Obviously this procedure can be done manually by splitting the dataset, applying the function, and adding the output as a column before putting it all back together again. But there must be a more efficient way to do this (perhaps this is a misuse of dplyr?).

511

asked Dec 06 '18 22:12

TKraft

2 Answers

How about making use of nest instead:

foo %>%
    group_by(fac) %>%
    nest() %>%
    mutate(mahal = map(data, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(.x, use = "pairwise.complete.obs")))) %>%
    unnest()
## A tibble: 10 x 4
#   fac   mahal      x       y
#   <fct> <dbl>  <dbl>   <dbl>
# 1 A     1.02   -6.26  15.1
# 2 A     0.120   1.84   3.90
# 3 A     2.81   -8.36  -6.21
# 4 A     2.84   16.0  -22.1
# 5 A     1.21    3.30  11.2
# 6 B     2.15   -8.20  -0.449
# 7 B     2.86    4.87  -0.162
# 8 B     1.23    7.38   9.44
# 9 B     0.675   5.76   8.21
#10 B     1.08   -3.05   5.94

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.

Update

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

Generate sample data with one additional column.

set.seed(1)
foo <- data.frame(
    x = rnorm(10, 0, 10),
    y = rnorm(10, 0, 10),
    z = rnorm(10, 0, 10),
    fac = c(rep("A", 5), rep("B", 5)))

We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a list
```
cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
```
So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

Now for the actual purrr command:

imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
    mutate_all(function(lst) map(lst, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(., use = "pairwise.complete.obs")))) %>%
    unnest() %>%
    bind_cols(foo, .)
#           x           y           z fac     cols1     cols2
#1  -6.264538  15.1178117   9.1897737   A 1.0197542 1.3608052
#2   1.836433   3.8984324   7.8213630   A 0.1199607 1.1141352
#3  -8.356286  -6.2124058   0.7456498   A 2.8059562 1.5099574
#4  15.952808 -22.1469989 -19.8935170   A 2.8401953 3.0675228
#5   3.295078  11.2493092   6.1982575   A 1.2141337 0.9475794
#6  -8.204684  -0.4493361  -0.5612874   B 2.1517055 1.2284793
#7   4.874291  -0.1619026  -1.5579551   B 2.8626501 1.1724828
#8   7.383247   9.4383621 -14.7075238   B 1.2271316 2.5723023
#9   5.757814   8.2122120  -4.7815006   B 0.6746788 0.6939081
#10 -3.053884   5.9390132   4.1794156   B 1.0838341 2.3328276

In short, we

loop over entries in cols,
nest data in foo per fac based on columns defined in cols,
apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
finally unnest the distance data and column-bind it to the original foo data.

105

answered Oct 23 '22 10:10

Maurits Evers

You can simply do -

foo %>% group_by(fac) %>%
  mutate(mahal = auto.mahalanobis(data.frame(x, y)))

# A tibble: 10 x 4
# Groups:   fac [2]
        x       y fac   mahal
    <dbl>   <dbl> <fct> <dbl>
 1 - 6.26  15.1   A     1.02 
 2   1.84   3.90  A     0.120
 3 - 8.36 - 6.21  A     2.81 
 4  16.0  -22.1   A     2.84 
 5   3.30  11.2   A     1.21 
 6 - 8.20 - 0.449 B     2.15 
 7   4.87 - 0.162 B     2.86 
 8   7.38   9.44  B     1.23 
 9   5.76   8.21  B     0.675
10 - 3.05   5.94  B     1.08

You can remove temp <- x[, c("x", "y")] from your function and simply use temp instead of x as function argument.

Cleaned-up function -

auto.mahalanobis <- function(temp) {
  mahalanobis(temp,
              center = colMeans(temp, na.rm=T),
              cov = cov(temp, use="pairwise.complete.obs")
              )
}

Btw, great job on your first post!

answered Oct 23 '22 09:10

Shree

Related questions
                            
                                How can I test Customer Chat Plugin (beta) in localhost?
                            
                                Azure AD GraphServiceClient can't set AdditionalData against User
                            
                                What "command verbs" are available for the os.startfile 'operation' argument and what do they do?
                            
                                Java Webstart "javaws -open" flag doesn't work with multiple arguments
                            
                                Python multiprocessing to execute scripts instead of function
                            
                                'new' keyword in dart
                            
                                Best practice for writing SQL queries in Jupyter
                            
                                Firing Wordpress Gutenberg "Convert to Blocks" programmatically
                            
                                Oracle hierarchical sum (distance from leaf to root)
                            
                                BigQuery Standard SQL: Setting a value to positive or negative infinity
                            
                                C# Reusing timers vs Allocating new one
                            
                                Rails Create Method Creating Duplicates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With