Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subsetting efficiently by using loop in R?

Tags:

loops

r

subset

I have a csv file named "table_parameter". Please, download from here. Data look like this:

           time        avg.PM10            sill       range         nugget
    1   2012030101  52.2692307692308    0.11054330  45574.072   0.0372612157
    2   2012030102  55.3142857142857    0.20250974  87306.391   0.0483153769
    3   2012030103  56.0380952380952    0.17711558  56806.827   0.0349567088
    4   2012030104  55.9047619047619    0.16466350  104767.669  0.0307528346
    .
    .
    .
    25  2012030201  67.1047619047619    0.14349774  72755.326   0.0300378129
    26  2012030202  71.6571428571429    0.11373430  72755.326   0.0320594776
    27  2012030203  73.352380952381 0.13893530  72755.326   0.0311135434
    28  2012030204  70.2095238095238    0.12642303  29594.037   0.0281416079
    .
    .

In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.

From this dataset I want subset (24*11) datframe like the table below:

enter image description here

for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.

I can do this subsetting manually by writing (24*11)240 lines code like this!

table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))

par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times  ==24 & avg.PM10>100)

But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?

FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.

Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.

like image 968
Orpheus Avatar asked Feb 01 '26 07:02

Orpheus


1 Answers

You can do this using some plyr, dplyr and tidyr magic :

library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.

EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :

library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 
  facet_grid(roundedPM.10~.)
like image 80
Tutuchan Avatar answered Feb 02 '26 23:02

Tutuchan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!