Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make a function in R to check for data errors?

I have a lot of csv files of temperature data which I am importing into R to process. These files look like:

ID   Date.Time          temp1    temp2
1    08/13/17 14:48:18  15.581  -0.423
2    08/13/17 16:48:18  17.510  -0.423
3    08/13/17 18:48:18  15.390  -0.423

Sometimes the temperature readings in columns 3 and 4 are clearly wrong and have to be replaced with NA values. I know that anything over 50 or under -50 is an error. I'd like to just remove these right away. Using

df[,c(3,4)]<- replace(df[,c(3,4)], df[,c(3,4)] >50, NA)
df[,c(3,4)] <- replace(df[,c(3,4)], df[,c(3,4)] < -50, NA)

works but I don't really want to have to repeat this for every file because it seems messy.

I would like to make a function to replace all this like:

df<-remove.errors(df[,c(3,4)])

I've tried:

remove.errors<-function (df) {
  df[,]<- replace(df[,], df[,] > 50, NA)
  df[,]<- replace(df[,], df[,] < -50, NA)
  }

df<-remove.errors(df[,c(3,4)])

This works but unfortunately only keeps the 3rd and 4th columns and the first two disappear. I've played around with this code for far too long and tried some other things which didn't work at all.

I know I'm probably missing something basic. Anyone have any tips on making a function which will replace values in columns 3 and 4 with NAs without changing the first two columns?

like image 477
user97878 Avatar asked Jan 18 '26 03:01

user97878


1 Answers

1) Try this. It uses only base R.

clean <- function(x, max = 50, min = -max) replace(x, x > max | x < min, NA)
df[3:4] <- clean(df[3:4])

1a) Alternately we could do this (which does not overwrite df):

transform(df, temp1 = clean(temp1), temp2 = clean(temp2))

2) Adding in magrittr we could do this:

library(magrittr)
df[3:4] %<>% { clean(.) }

3) In dplyr we could do this:

library(dplyr)

df %>% mutate_at(3:4, clean)
like image 165
G. Grothendieck Avatar answered Jan 20 '26 15:01

G. Grothendieck