Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create summary table in R using statistics from package `modifiedmk`

I'm trying to run a function from the modifiedmk package in R.

install.packages('modifiedmk')
library(modifiedmk)

I have a dataframe data which I produced with the following:

Station <- c('APT','APT', 'APT','APT', 'APT', 'APT', 'APT','APT', 'APT','APT','APT','APT',
              'AF','AF', 'AF','AF','AF','AF','AF','AF','AF',
             'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL', 'EL',
             'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS', 'GFS'
              )
Rainfall <- c(375.3, 263.3, 399.2, 242.6, 847.6, 276.5, 712.8, 366.3, 188.6, 478.4, 539, 682.5,
            520.7, 1337.8, 524, 908.4,748.5,411.8, 772.4,978.5,983,
            732.4, 788.6, 567.1, 576, 931.6, 727.2, 1079.3, 902.8,493.4,  630.7, 784.1,660.2, 531.3, 487.1,798.4,
            1064.1,  590.3, 1011.2, 1037.1,  1398.4, 1153.6,994.1,  1100.2,743.7,637.4, 792.2, 891.9,880.9, 670, 920.2,681.4)
Year <- c('1957','1958','1959','1960','1961','1962','1963','1964','1965','1966','1967','1968',
                  '1960','1961','1962','1963','1964','1965','1966','1967','1968',
                  '1957','1958','1959','1960','1961','1962','1963','1964','1965','1966','1967','1968','1969','1970','1971',
                  '1964','1965','1966','1967','1968','1969','1970','1971','1972','1973','1974','1975','1976','1977','1978','1979')
length(Year)
data<-data.frame(Year, Station, Rainfall)

where I have four Stations of rainfall data as rows in the dataframe. I want to apply the mmky1lag method from the modifiedmk package on each Station of data and produce a summary table in R that has two columns:

  1. The percent of stations with significant trends where p < 0.05
  2. The average Sen's slope

For example, I can run the mmky1lag method on all of the Rainfall data using mmky1lag(as.vector(data$Rainfall)) which produces

> mmky1lag(as.vector(data$Rainfall))
Corrected Zc  new P-value         N/N*   Original Z  old P.value 
3.332353e+00 8.611480e-04 1.297360e+00 3.795608e+00 1.472822e-04 
         Tau  Sen's slope old.variance new.variance 
3.634992e-01 9.092857e+00 1.605933e+04 2.083474e+04

And I'm interested in two of those outputs:

Column 1:

# Get percent of stations with significant trends where p < 0.05
mmky1lag(as.vector(data$Rainfall))[2] < 0.05

and Column 2:

# Make another column that is the mean Sen's slope
mmky1lag(as.vector(data$Rainfall))[7] 

However, how do I apply this method on data where I get a result for each individual Station? In python, I would groupby Station and then apply the method. But I'm not sure how to do that in R.

Then after grouping by station, I want a summary table with the two aforementioned columns of information.

like image 487
JAG2024 Avatar asked Nov 04 '20 20:11

JAG2024


People also ask

How do you create a summary statistics table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.

What is a summary statistics table?

The summary table is a visualization that summarizes statistical information about data in table form.

How do I get the summary of a column in R?

Descriptive statistics in R (Method 1): summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.


3 Answers

If you want to apply the mmky1lag function to a data frame by group (in this case, station) there are multiple approaches to consider.

First, you could use aggregate:

library(modifiedmk)

mktests <- aggregate(Rainfall ~ Station, data = data, FUN = mmky1lag)

This will take a formula using the measure of Rainfall by Station group. All of your results will be returned in a matrix with the MK test parameters in a single column.

Another approach might be with data.table package.

library(data.table)

mktests <- as.data.table(data)[, as.list(mmky1lag(Rainfall)), by = Station]

This will take the results from mmky1lag and put into a list and then converted to a data table. The option by will allow you to perform this by Station.

A third approach might be with dplyr package.

library(dplyr)

mktests <- data %>%
  group_by(Station) %>%
  group_map(~mmky1lag(.x$Rainfall)) %>%
  setNames(unique(sort(data$Station))) %>%
  bind_rows(.id = "Station")

This uses group_by to group by Station, and then group_map which will apply the mmky1lag function to grouped elements. The setNames is needed to add Station value back to the results, and then bind_rows to turn the resultant list into a data frame.

The result (with the data.table solution) should look like this (the other approaches should be similar):

R> mktests
   Station Corrected Zc new P-value      N/N* Original Z old P.value        Tau Sen's slope old.variance new.variance
1:     APT    1.2801214   0.2005025 0.4849366  0.8914431   0.3726915  0.2121212    17.32083     212.6667    103.12986
2:      AF    1.2424858   0.2140574 0.5703144  0.9383149   0.3480826  0.2777778    29.73750      92.0000     52.46892
3:      EL   -0.7452428   0.4561249 1.1288325 -0.7917947   0.4284804 -0.1619048    -9.60000     408.3333    460.93994
4:     GFS   -1.3242038   0.1854354 1.4160741 -1.5757881   0.1150746 -0.3000000   -19.65333     493.3333    698.59657

If you want the percentage of Stations with p < .05, you could do:

sum(mktests$`new P-value` < .05) / nrow(mktests)

In this case, it is zero as none of them were significant based on new P-value.

The mean of Sen's slope can be computed:

mean(mktests$`Sen's slope`)
4.45125

I'm not sure if you anticipated different results with your example data (as you suggested the results would be put into 2 columns). Please let me know if this is what you had in mind.

like image 179
Ben Avatar answered Oct 18 '22 04:10

Ben


You can try do do something like this, in base R.
First, you can have your data as a list, and each element is one Station:

data_list <- split(data,data$Station)

Then you can use lapply(), quoting from the doc:

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

library(modifiedmk)
stat_list <- lapply(data_list, function(x) mmky1lag(x$Rainfall))

Now, you can put as data.frame for example, and then calculate what you need. You can use do.call() to apply rbind() to the list, and put it in a data.frame(). Generally I prefere work with the names of the columns instead of their index, but this is subjective.
From the doc rbind():

Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. These are generic functions with methods for other R classes.

From the doc do.call():

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

stat_df <- data.frame(do.call(rbind, stat_list))

Now you can easily calculate what you need:

# percentage of the < 0.05 p-values
# here you calculate the number of row of the subset of interest of the
# df / number of row of the dataset.
nrow(stat_df[stat_df$new.P.value < 0.05,])/nrow(stat_df)*100
[1] 0

# Or if you want a prettier result printed:
library(formattable)
percent(nrow(stat_df[stat_df$new.P.value < 0.05,])/nrow(stat_df))
[1] 0.00%

# the mean of Sen.s.slope
mean(stat_df$Sen.s.slope)
[1] 4.45125

Also, I do not get the way you'd like the desired output, it's written Column1 and Column2. If you define it, it's possible to have a result that fits better to your requirements.

like image 2
s__ Avatar answered Oct 18 '22 04:10

s__


Does this come close? The percentage would be zero, as all p-values are bigger than 5%. You'd need to add the < 0.05 in the loop to get a true/false value in the data frame.

results <- data.frame(matrix(NA, 4, 3))
colnames(results) <- c('station', 'p-val', 'Sen-slope')
for(ii in seq_along(unique(Station))){
  i <- unique(Station)[ii]
  results[ii, 1] <- i
  results[ii, 2] <- mmky1lag(as.vector(data$Rainfall[data$Station %in% i]))[2]
  results[ii, 3] <- mmky1lag(as.vector(data$Rainfall[data$Station %in% i]))[7]
}

> results
  station     p-val Sen-slope
1     APT 0.2005025  17.32083
2      AF 0.2140574  29.73750
3      EL 0.4561249  -9.60000
4     GFS 0.1854354 -19.65333
like image 1
tester Avatar answered Oct 18 '22 04:10

tester