Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apply function to grouped rows in dataframe [duplicate]

Tags:

r

dplyr

purrr

I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:

range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
  cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)    
return(out)
} 

I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:

LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)

I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:

testresult <- testdf %>%
  group_by(SPECIES, YEAR) %>%
  range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)

However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.

like image 459
AFH Avatar asked Sep 25 '17 22:09

AFH


People also ask

How do I get repeated rows in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do I iterate over a group in pandas?

groupby() to Iterate over Data frame Groups. DataFrame. groupby() function in Python is used to split the data into groups based on some criteria.

What does groupby apply return?

The function passed to apply must take a DataFrame as its first argument and return a DataFrame. apply will then take care of combining the results back together into a single dataframe.


2 Answers

Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.

First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.

library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x = rangedf %>% pull(lat), 
                            w = rangedf %>% pull(weighting), na.rm=T)
  cent_lon <- weighted.mean(x = rangedf %>% pull(lon), 
                            w = rangedf %>% pull(weighting), na.rm=T)
  out <- data.frame(cent_lat, cent_lon)    
  return(out)
} 

Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.

LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)

Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.

We can first create a group indices based on SPECIES and YEAR.

testdf2 <- testdf %>%
  mutate(Group = group_indices(., SPECIES, YEAR)) 
testdf2
# A tibble: 6 x 6
  LATITUDE LONGITUDE   BIOMASS               SPECIES  YEAR Group
     <dbl>     <dbl>     <dbl>                 <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis  2005     1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis  2005     1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis  2014     2
4 28.38918 -178.3571 2.1419699   Chaetodon lunulatus  2009     3
5 25.97517 -173.9679 0.3451426   Chaetodon lunulatus  2009     3
6 27.96313 -175.7837 1.0946017   Chaetodon lunulatus  2015     4

As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.

testresult <- testdf2 %>%
  split(.$Group) %>%
  map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE", 
          weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
  Group cent_lat  cent_lon
1     1 27.57259 -174.9191
2     2 26.06340 -173.9593
3     3 28.05418 -177.7480
4     4 27.96313 -175.7837

Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.

I separated the process into two steps, but of course they can be all in one pipeline as follows.

testresult  <- testdf %>%
  mutate(Group = group_indices(., SPECIES, YEAR))  %>%
  split(.$Group) %>%
  map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE", 
          weighting = "BIOMASS", na.rm = TRUE, .id = "Group")

If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.

like image 83
www Avatar answered Sep 28 '22 01:09

www


Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:

x=rangedf[,lat]

To double brackets:

x=rangedf[[lat]]

Adjusted method:

range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
  cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
  cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
  out <- data.frame(cent_lat, cent_lon)    
  return(out)
} 

As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().

The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:

df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
                data.frame(species=df$SPECIES[1],
                           year=df$YEAR[1],
                           range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
              )

finaldf <- do.call(rbind, df_List)
finaldf
#                 species year cent_lat  cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2   Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4   Chaetodon lunulatus 2015 27.96313 -175.7837
like image 32
Parfait Avatar answered Sep 27 '22 23:09

Parfait