apply function to grouped rows in dataframe [duplicate]

Tags:

I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:

range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
  cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)    
return(out)
}

I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:

LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)

I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:

testresult <- testdf %>%
  group_by(SPECIES, YEAR) %>%
  range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)

However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.

459

asked Sep 25 '17 22:09

AFH

2 Answers

Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.

First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.

library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x = rangedf %>% pull(lat), 
                            w = rangedf %>% pull(weighting), na.rm=T)
  cent_lon <- weighted.mean(x = rangedf %>% pull(lon), 
                            w = rangedf %>% pull(weighting), na.rm=T)
  out <- data.frame(cent_lat, cent_lon)    
  return(out)
}

Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.

LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)

Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.

We can first create a group indices based on SPECIES and YEAR.

testdf2 <- testdf %>%
  mutate(Group = group_indices(., SPECIES, YEAR)) 
testdf2
# A tibble: 6 x 6
  LATITUDE LONGITUDE   BIOMASS               SPECIES  YEAR Group
     <dbl>     <dbl>     <dbl>                 <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis  2005     1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis  2005     1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis  2014     2
4 28.38918 -178.3571 2.1419699   Chaetodon lunulatus  2009     3
5 25.97517 -173.9679 0.3451426   Chaetodon lunulatus  2009     3
6 27.96313 -175.7837 1.0946017   Chaetodon lunulatus  2015     4

As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.

testresult <- testdf2 %>%
  split(.$Group) %>%
  map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE", 
          weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
  Group cent_lat  cent_lon
1     1 27.57259 -174.9191
2     2 26.06340 -173.9593
3     3 28.05418 -177.7480
4     4 27.96313 -175.7837

Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.

I separated the process into two steps, but of course they can be all in one pipeline as follows.

testresult  <- testdf %>%
  mutate(Group = group_indices(., SPECIES, YEAR))  %>%
  split(.$Group) %>%
  map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE", 
          weighting = "BIOMASS", na.rm = TRUE, .id = "Group")

If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.

answered Sep 28 '22 01:09

www

Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:

x=rangedf[,lat]

To double brackets:

x=rangedf[[lat]]

Adjusted method:

range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
  cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
  out <- data.frame(cent_lat, cent_lon)    
  return(out)
}

As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().

The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:

df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
                data.frame(species=df$SPECIES[1],
                           year=df$YEAR[1],
                           range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
              )

finaldf <- do.call(rbind, df_List)
finaldf
#                 species year cent_lat  cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2   Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4   Chaetodon lunulatus 2015 27.96313 -175.7837

answered Sep 27 '22 23:09

Parfait

Related questions
                            
                                Replace random values in a column in a dataframe
                            
                                How to use biglm with more than 2^31 observations
                            
                                Print message into R markdown console while knitting
                            
                                Creating monthly data and expanding data
                            
                                Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?
                            
                                What cause format change when copy and paste in rstudio?
                            
                                dplyr::select of nested data frame
                            
                                Efficient creation of a matrix of offsets
                            
                                How to generate a custom color scale for plotly heatmap in R
                            
                                R Shiny Image without padding/ stretched across page using css
                            
                                How to run user input as R code in a Shiny app?
                            
                                R: Referencing data.table fields in cut function in j clause
                            
                                Rename column of dataframes inside a list with its dataframe name
                            
                                How to merge lists of vectors based on one vector belonging to another vector?
                            
                                Computation failed in `stat_smooth()`: object 'C_crspl' not found
                            
                                Sort ggplot boxplots by median with facets
                            
                                Programmatically rename data frame columns using lookup data frame
                            
                                How to select elements with the same name from nested list with purrr?
                            
                                String based filtering in dplyr - NSE
                            
                                Element of vector to different columns of data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

apply function to grouped rows in dataframe [duplicate]

Tags:

r

dplyr

purrr

AFH

People also ask

2 Answers

www

Parfait

Recent Activity

Donate For Us