I'm trying to run a function from the modifiedmk
package in R.
install.packages('modifiedmk')
library(modifiedmk)
I have a dataframe data
which I produced with the following:
Station <- c('APT','APT', 'APT','APT', 'APT', 'APT', 'APT','APT', 'APT','APT','APT','APT',
'AF','AF', 'AF','AF','AF','AF','AF','AF','AF',
'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL',
'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS'
)
Rainfall <- c(375.3, 263.3, 399.2, 242.6, 847.6, 276.5, 712.8, 366.3, 188.6, 478.4, 539, 682.5,
520.7, 1337.8, 524, 908.4,748.5,411.8, 772.4,978.5,983,
732.4, 788.6, 567.1, 576, 931.6, 727.2, 1079.3, 902.8,493.4, 630.7, 784.1,660.2, 531.3, 487.1,798.4,
1064.1, 590.3, 1011.2, 1037.1, 1398.4, 1153.6,994.1, 1100.2,743.7,637.4, 792.2, 891.9,880.9, 670, 920.2,681.4)
Year <- c('1957','1958','1959','1960','1961','1962','1963','1964','1965','1966','1967','1968',
'1960','1961','1962','1963','1964','1965','1966','1967','1968',
'1957','1958','1959','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971',
'1964','1965','1966','1967','1968','1969','1970','1971','1972','1973','1974','1975','1976','1977','1978','1979')
length(Year)
data<-data.frame(Year, Station, Rainfall)
where I have four Stations
of rainfall data as rows in the dataframe. I want to apply the mmky1lag
method from the modifiedmk
package on each Station
of data and produce a summary table in R that has two columns:
For example, I can run the mmky1lag
method on all of the Rainfall data using mmky1lag(as.vector(data$Rainfall))
which produces
> mmky1lag(as.vector(data$Rainfall))
Corrected Zc new P-value N/N* Original Z old P.value
3.332353e+00 8.611480e-04 1.297360e+00 3.795608e+00 1.472822e-04
Tau Sen's slope old.variance new.variance
3.634992e-01 9.092857e+00 1.605933e+04 2.083474e+04
And I'm interested in two of those outputs:
Column 1:
# Get percent of stations with significant trends where p < 0.05
mmky1lag(as.vector(data$Rainfall))[2] < 0.05
and Column 2:
# Make another column that is the mean Sen's slope
mmky1lag(as.vector(data$Rainfall))[7]
However, how do I apply this method on data
where I get a result for each individual Station
? In python, I would groupby Station
and then apply the method. But I'm not sure how to do that in R.
Then after grouping by station, I want a summary table with the two aforementioned columns of information.
The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.
The summary table is a visualization that summarizes statistical information about data in table form.
Descriptive statistics in R (Method 1): summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.
If you want to apply the mmky1lag
function to a data frame by group (in this case, station
) there are multiple approaches to consider.
First, you could use aggregate
:
library(modifiedmk)
mktests <- aggregate(Rainfall ~ Station, data = data, FUN = mmky1lag)
This will take a formula using the measure of Rainfall
by Station
group. All of your results will be returned in a matrix with the MK test parameters in a single column.
Another approach might be with data.table
package.
library(data.table)
mktests <- as.data.table(data)[, as.list(mmky1lag(Rainfall)), by = Station]
This will take the results from mmky1lag
and put into a list and then converted to a data table. The option by
will allow you to perform this by Station
.
A third approach might be with dplyr
package.
library(dplyr)
mktests <- data %>%
group_by(Station) %>%
group_map(~mmky1lag(.x$Rainfall)) %>%
setNames(unique(sort(data$Station))) %>%
bind_rows(.id = "Station")
This uses group_by
to group by Station
, and then group_map
which will apply the mmky1lag
function to grouped elements. The setNames
is needed to add Station
value back to the results, and then bind_rows
to turn the resultant list into a data frame.
The result (with the data.table
solution) should look like this (the other approaches should be similar):
R> mktests
Station Corrected Zc new P-value N/N* Original Z old P.value Tau Sen's slope old.variance new.variance
1: APT 1.2801214 0.2005025 0.4849366 0.8914431 0.3726915 0.2121212 17.32083 212.6667 103.12986
2: AF 1.2424858 0.2140574 0.5703144 0.9383149 0.3480826 0.2777778 29.73750 92.0000 52.46892
3: EL -0.7452428 0.4561249 1.1288325 -0.7917947 0.4284804 -0.1619048 -9.60000 408.3333 460.93994
4: GFS -1.3242038 0.1854354 1.4160741 -1.5757881 0.1150746 -0.3000000 -19.65333 493.3333 698.59657
If you want the percentage of Station
s with p < .05, you could do:
sum(mktests$`new P-value` < .05) / nrow(mktests)
In this case, it is zero as none of them were significant based on new P-value
.
The mean of Sen's slope
can be computed:
mean(mktests$`Sen's slope`)
4.45125
I'm not sure if you anticipated different results with your example data (as you suggested the results would be put into 2 columns). Please let me know if this is what you had in mind.
You can try do do something like this, in base R.
First, you can have your data as a list, and each element is one Station
:
data_list <- split(data,data$Station)
Then you can use lapply()
, quoting from the doc:
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
library(modifiedmk)
stat_list <- lapply(data_list, function(x) mmky1lag(x$Rainfall))
Now, you can put as data.frame
for example, and then calculate what you need.
You can use do.call()
to apply rbind()
to the list, and put it in a data.frame()
. Generally I prefere work with the names of the columns instead of their index, but this is subjective.
From the doc rbind()
:
Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. These are generic functions with methods for other R classes.
From the doc do.call()
:
do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
stat_df <- data.frame(do.call(rbind, stat_list))
Now you can easily calculate what you need:
# percentage of the < 0.05 p-values
# here you calculate the number of row of the subset of interest of the
# df / number of row of the dataset.
nrow(stat_df[stat_df$new.P.value < 0.05,])/nrow(stat_df)*100
[1] 0
# Or if you want a prettier result printed:
library(formattable)
percent(nrow(stat_df[stat_df$new.P.value < 0.05,])/nrow(stat_df))
[1] 0.00%
# the mean of Sen.s.slope
mean(stat_df$Sen.s.slope)
[1] 4.45125
Also, I do not get the way you'd like the desired output, it's written Column1 and Column2. If you define it, it's possible to have a result that fits better to your requirements.
Does this come close? The percentage would be zero, as all p-values are bigger than 5%. You'd need to add the < 0.05
in the loop to get a true/false value in the data frame.
results <- data.frame(matrix(NA, 4, 3))
colnames(results) <- c('station', 'p-val', 'Sen-slope')
for(ii in seq_along(unique(Station))){
i <- unique(Station)[ii]
results[ii, 1] <- i
results[ii, 2] <- mmky1lag(as.vector(data$Rainfall[data$Station %in% i]))[2]
results[ii, 3] <- mmky1lag(as.vector(data$Rainfall[data$Station %in% i]))[7]
}
> results
station p-val Sen-slope
1 APT 0.2005025 17.32083
2 AF 0.2140574 29.73750
3 EL 0.4561249 -9.60000
4 GFS 0.1854354 -19.65333
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With