If I try to get an average of c(NA, NA, 3, 4, 5, 6, 7, 8, 9, 10)
using AVG
from SQL, I get a value of 5.2, instead of the expected 6.5.
# prepare data and write to file
write.table(data.frame(col1 = c(NA, NA, 3:10)),
"my.na.txt", row.names = FALSE)
mean(c(NA, NA, 3:10), na.rm = TRUE) # 6.5
my.na <- read.csv.sql("my.na.txt", sep = " ",
sql = "SELECT AVG(col1) FROM file") # 5.2
# this is identical to
sum(3:10)/10
unlink("my.na.txt") # remove file
Which leads me to believe that sql(df) treats NA values as zero. Is it possible to ignore (exclude) NA values in an SQL call as it can be done using na.rm
argument (in R)?
Modify your query to ignore the NA
values:
SELECT AVG(col1)
FROM file
WHERE col1 IS NOT \"NA\"
The problem is that the read.csv.sql
function does not recognize the missing values, and converts them to zero, instead of NULL.
This does not happen if you first load the data into a data.frame, and only then call sqldf
.
d <- read.csv("my.na.txt")
sqldf("SELECT AVG(col1) FROM d") # 6.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With