Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of rows matching a criteria

Tags:

r

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.

Select count(*) as number_of_states 
  from myTable
where  sCode = "CA"

so essentially I would be counting number of rows matching my where condition.

I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.

  1. nrow(mydata$sCode == "CA") ## ==>> returns NULL

  2. sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables

  3. sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables

  4. sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.

and some variations of the above samples. Any help would be appreciated! Thanks.

like image 870
multi-sam Avatar asked Jan 28 '15 15:01

multi-sam


People also ask

How do I count cells in Excel based on conditions?

COUNTIF is an Excel function to count cells in a range that meet a single condition. COUNTIF can be used to count cells that contain dates, numbers, and text. The criteria used in COUNTIF supports logical operators (>,<,<>,=) and wildcards (*,?) for partial matching.


3 Answers

mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:

> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1]  TRUE  TRUE FALSE

There are a couple of ways to deal with this:

  1. sum(mydata$sCode == "CA"), as suggested in the comments; because TRUE is interpreted as 1 and FALSE as 0, this should return the numer of TRUE values in your vector.

  2. length(which(mydata$sCode == "CA")); the which() function returns a vector of the indices where the condition is met, the length of which is the count of "CA".

Edit to expand upon what's happening in #2:

> which(mydata$sCode == "CA")
[1] 1 2

which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.

like image 96
Joe Avatar answered Oct 22 '22 12:10

Joe


sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.

Let's assume your data is a data frame named "dat". Correct solutions:

nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
like image 32
Alex W Avatar answered Oct 22 '22 10:10

Alex W


  1. mydata$sCode is a vector, it's why nrow output is NULL.
  2. mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
  3. subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use

    length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))

Or you can try this: sum(na.omit(mydata$sCode) == "CA")

like image 7
Fedorenko Kristina Avatar answered Oct 22 '22 12:10

Fedorenko Kristina