I have a data frame set up like so:
N1 <- c(1,2,4,3,2,3,4,5,4,3,4,5,4,5,6,8,9)
Start <- c("","Start","","","","","","","Start","","","","Start","","","","")
Stop <- c("","","","","Stop","","","","","","Stop","","","","Stop","","")
With N1 being my data of interest. I would like to calculate the mean of a string of numbers based on the "Start" and "Stop" locations in the next two columns.
The strings as defined by "Start" and "Stop" would look like so:
2,4,3,2
4,3,4
4,5,6
So my final result should be 3 means:
2.75,3.6,5
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
Method -1 : Using select() method If we want to return the average value from multiple columns, we have to use the avg() method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the average value.
To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.
you can try:
mapply(function(start, stop){
mean(N1[start:stop])
},
start=which(Start!=""),
stop=which(Stop!=""))
#[1] 2.750000 3.666667 5.000000
library(data.table) # need latest 1.9.5+
# set up data to have all 1's column for the period we're interested in and 0 otherwise
d = data.table(N1, event = cumsum((Start != "") - c(0, head(Stop != "", -1))))
d[, mean(N1), by = .(event, rleid(event))][event == 1, V1]
#[1] 2.750000 3.666667 5.000000
# or equivalently
d[, .(event[1], mean(N1)), by = rleid(event)][V1 == 1, V2]
you can also try rollapply
library(zoo)
x <- sort(c(which(Stop != ""), which(Start != ""))) # indices of Start and Stop
rollapply(x, 2, FUN = function(y) mean(N1[y[1]:y[2]]), by=2)
[1] 2.750000 3.666667 5.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With