Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Programming using "dplyr" to select rows and return the index of the rows found

Tags:

r

dplyr

Setup/Problem:

Using dplyr - I cannot determine the optimum way to return the row index of a filtered row as opposed to returning the content of the filtered row.

Issue:

I can use dplyr::filter() to extract the row from the dataframe... the issue is that want to extract the index value of the filtered row and add it to a list of index entries that meet the search criteria.

Question:

Is there a simple way to search a dataframe using dplyr against specific criteria and return the numeric index of each row found? The code below uses r::which() to extract the index rows to a list...

    requiredPackages <- c("dplyr")

    ipak <- function(pkg){
            new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
            if (length(new.pkg))
                    install.packages(new.pkg, dependencies = TRUE)
            sapply(pkg, require, character.only = TRUE)
    }

    ipak(requiredPackages)

    if (!file.exists("./week3/data")) {
            dir.create("./week3/data")
    }

    # CSV Download
    if (!file.exists("./week3/data/americancommunitySurvey.csv")) {
            fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv?accessType=DOWNLOAD"
            download.file(fileUrl, destfile = "./week3/data/americancommunitySurvey.csv", method = "curl")
    }

    housingData <- tbl_df(read.csv("./week3/data/americancommunitySurvey.csv"
                                   , stringsAsFactors = TRUE))

 Now we have to extract the relevant data
#
# Create a logical vector that identifies the households on greater than 10
# acres who sold more than $10,000 worth of agriculture products. Assign that
# logical vector to the variable agricultureLogical. Apply the which() function
# like this to identify the rows of the data frame where the logical vector is
# TRUE. which(agricultureLogical) What are the first 3 values that result?
#
# ACR 1
# Lot size
# b .N/A (GQ/not a one-family house or mobile home)
# 1 .House on less than one acre
# 2 .House on one to less than ten acres
# 3 .House on ten or more acres                 ACR == 3
#
# AGS 1
# Sales of Agriculture Products
# b .N/A (less than 1 acre/GQ/vacant/
#                 .2 or more units in structure)
# 1 .None
# 2 .$ 1 - $ 999
# 3 .$ 1000 - $ 2499
# 4 .$ 2500 - $ 4999
# 5 .$ 5000 - $ 9999
# 6 .$10000+                                    AGS == 6
#
# Thus, we need to select only the results that have a ACR == 3 AND a AGS == 6
#
agricultureLogical <- which(housingData$ACR == 3 & housingData$AGS == 6)
agricultureLogical
# Now we can display the first three values of the resulting list
head(agricultureLogical[1:3])

The above code gives me the result I want but I want to understand how to do this with dplyr. It is bugging me... I can use dplyr::filter() as follows to extract the rows lines - how do I extract the index of each row found????

agricultureLogical <- filter(housingData, ACR == 3 & housingData$AGS == 6)

R Setup

version _
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 1.2
year 2014
month 10
day 31
svn rev 66913
language R
version.string R version 3.1.2 (2014-10-31) nickname Pumpkin Helmet

dplyr version 0.3.0.2

Setup Mac OS X

Model Name: MacBook Pro Model Identifier: MacBookPro10,1 Processor Name: Intel Core i7 Processor Speed: 2.7 GHz Number of Processors: 1 Total Number of Cores: 4 L2 Cache (per Core): 256 KB L3 Cache: 8 MB Memory: 16 GB

like image 880
Technophobe01 Avatar asked Jan 17 '15 22:01

Technophobe01


3 Answers

If you are using dplyr >= 0.4 you can do the following

housingData %>%
  add_rownames() %>%
  filter(ACR == 3 & AGS == 6) %>%
  `[[`("rowname") %>%
  as.numeric() -> agricultureLogical

Though why you would consider this an improvement over

agricultureLogical <- which(housingData$ACR == 3 & housingData$AGS == 6)

escapes me.

like image 140
Ista Avatar answered Nov 20 '22 17:11

Ista


Proposed Solution

Here is an example of what I am trying to do... this is sort of a solution but I do not like it. Thanks for to Richard Scriven for the pointer to 1:n()...

Add an index column to the dataframe manually...

I have still not figured out how to return individual index numbers for each of the rows that match a particular set of criteria...

So I added an index column to the example data frame using dplyr:mutate(). I then use dplyr::filter() on the data-frame to apply a filter against the required criteria. This leaves me with the list of rows I want to play with... including an index to the original data frame... I now use dplyr::select() to extract only the index column of the original dataframe entries for each row that meets the criteria...

h1 <- housingData
# Add an index column to the dataframe h1...
h1 <- mutate(h1, IDX = 1:n())
# Filter the h1 dataframe using the criteria defined...
h1 <- filter(h1, ACR == 3 & housingData$AGS == 6)
# Extract the index 
h1 <- select(h1, IDX)
# Convert to an integer list...
agricultureLogical <- as.integer(as.character(h1$IDX))
head(agricultureLogical[1:3])

the above to me is duplicated effort as the index is implicit in the origional dataframe. Hence my sense is there must be a way to return the set of indexes of the items identified by the filter... Answers appreciated :-)

like image 7
Technophobe01 Avatar answered Nov 20 '22 17:11

Technophobe01


As add_rownames() is deprecated you can use rownames_to_column(). The Ista's solution will be in the following format:

housingData %>%
rownames_to_column() %>%
filter(ACR == 3 & AGS == 6) %>%
`[[`("rowname") %>%
as.numeric() -> agricultureLogical
like image 4
ROBBAT1 Avatar answered Nov 20 '22 18:11

ROBBAT1