Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering data in a dataframe based on criteria

Tags:

r

subset

I am new to R and can't get to grips with this concept. Suppose I have a table loaded called "places" with 3 say columns - city, population and average summer temperature

Say I want to "filter" - produce a new table object where population is less than 1 million and average summer temperature is greater than 70 degrees.

In any other program I have used this would be pretty easy but having done some research I'm working myself up into greater confusion. Given the purpose of R and what it does this must be pretty simple stuff.

How would I apply the above conditions to the table? What would the steps be? From what i understand, I cannot easily just select the table headings based on their name, which would be nice (e.g. WHERE city < 1,000,000 )

like image 228
Doug Fir Avatar asked Jan 07 '13 22:01

Doug Fir


2 Answers

You are looking for subset

if your data is called mydata

newdata <- subset(mydata, city < 1e6)

Or you could use [, which is programatically safer

newdata <- mydata[mydata$city < 1e6]

For more than one condition use & or | where approriate

You could also use the sqldf package to use sql

library(sqldf)

newdata <-  sqldf('select * from mydata where city > 1e6')

Or you could use data.table which makes the syntax easier for [ (as well as being memory efficient)

library(data.table)

mydatatable <- data.table(mydata)
newdata <- mydatatable[city > 1e6]
like image 165
mnel Avatar answered Sep 20 '22 13:09

mnel


Given a dataframe "dfrm" with the names of the cities in the 'city' column, the population in the "population" column and the average summer temperature in the "meanSummerT" column your request for the subset meeting those joint requirements would be met with any of these:

subset( dfrm, population < 1e6 & meanSummerT > 70)
dfrm[ which(dfrm$population < 1e6 & dfrm$meanSummerT > 70) , ]
dfrm[ which( dfrm[[ 'population' ]] < 1e6 & dfrm[[ 'meanSummerT' ]] > 70) , ]

If you wanted just the names of the cities meeting those joint criteria then these would work:

subset( dfrm, population < 1e6 & meanSummerT > 70 , city)
dfrm[ which(dfrm$population < 1e6 & dfrm$meanSummerT > 70) , "city" ]
dfrm[ which(dfrm[['population']] < 1e6 & dfrm[['meanSummerT']] > 70) , "city" ]

Note that the column names are not quoted in the subset or following the "$" operator but they are quoted inside "[[". And note that using which can be dangerous if no lines of data match because instead of getting no lines you will get the entire dataframe.

like image 32
IRTFM Avatar answered Sep 19 '22 13:09

IRTFM