I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:
testresult <- testdf %>%
group_by(SPECIES, YEAR) %>%
range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)
However, the code above does not work (I get a (list) object cannot be coerced to type 'double'
error) and I am not sure how else to approach the problem.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
groupby() to Iterate over Data frame Groups. DataFrame. groupby() function in Python is used to split the data into groups based on some criteria.
The function passed to apply must take a DataFrame as its first argument and return a DataFrame. apply will then take care of combining the results back together into a single dataframe.
Since you add the tag of dplyr
and purrr
, I assume you are interested in a tidyverse
solution. So below I will demonstrate a solution based on the tidyverse
.
First, your range_stats
is problematic. This is why you got the error message. The weighted.mean
is expecting a vector for both the x
and w
argument. However, if rangedf
is a tibble
, the way you subset the tibble
, such as rangedf[,lat]
will still return a one-column tibble
. A better way is to use pull
from the dplyr
package.
library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x = rangedf %>% pull(lat),
w = rangedf %>% pull(weighting), na.rm=T)
cent_lon <- weighted.mean(x = rangedf %>% pull(lon),
w = rangedf %>% pull(weighting), na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
Next, the way you created the data frame is OK, but data.table
is from the data.table
package and you will create a data.table
, not a tibble
. I thought you want to use an approach from tidyverse
, so I changed data.table
to data_frame
as follows.
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
Now, you said you want to apply the range_stats
function to each combination of SPECIES
and YEAR
. One approach is to split the data frame to a list of data frames, and use lapply
family function. But here I want to show you how to use the map
family function to achieve this task as map
is from the purrr
package, which is part of the tidyverse
.
We can first create a group indices based on SPECIES
and YEAR
.
testdf2 <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR))
testdf2
# A tibble: 6 x 6
LATITUDE LONGITUDE BIOMASS SPECIES YEAR Group
<dbl> <dbl> <dbl> <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis 2005 1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis 2005 1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis 2014 2
4 28.38918 -178.3571 2.1419699 Chaetodon lunulatus 2009 3
5 25.97517 -173.9679 0.3451426 Chaetodon lunulatus 2009 3
6 27.96313 -175.7837 1.0946017 Chaetodon lunulatus 2015 4
As you can see, Group
is a new column showing the index number. Now we can split the data frame based on Group
, and then use map_dfr
to apply the range_stats
function.
testresult <- testdf2 %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
Group cent_lat cent_lon
1 1 27.57259 -174.9191
2 2 26.06340 -173.9593
3 3 28.05418 -177.7480
4 4 27.96313 -175.7837
Notice that map_dfr
can automatic bind the output list of data frames to a single data frame. .id = "Group"
means we want to create a column called Group
based on the name of the list element.
I separated the process into two steps, but of course they can be all in one pipeline as follows.
testresult <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR)) %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
If you want, testresult
can be merged with testdf
using left_join
, but I will stop here as testresult
is probably already the desired output you want. I hope this helps.
Fundamentally, the main issue involves weighted.mean()
where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:
x=rangedf[,lat]
To double brackets:
x=rangedf[[lat]]
Adjusted method:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
As for overall group by slice computation, do forgive me in bypassing, dplyr
and data.table
which you use and consider base R's underutilized but useful method, by()
.
The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by()
expects one aggregation vector operation. However, by
passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind
for one final dataframe:
df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
data.frame(species=df$SPECIES[1],
year=df$YEAR[1],
range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
)
finaldf <- do.call(rbind, df_List)
finaldf
# species year cent_lat cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2 Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4 Chaetodon lunulatus 2015 27.96313 -175.7837
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With