Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern matching in a data frame context

I have a data frame, the first 5 lines of which looks as follows:

Sample    CCT6        GAT1                   IMD3          PDR3          RIM15
001       0000000000  111111111111111111111  010001000011  0N100111NNNN  01111111111NNNNNN
002       1111111111  111111111111111111000  000000000000  0N100111NNNN  00000000000000000
003       0NNNN00000  000000000000000000000  010001000011  000000000000  11111111111111111
004       000000NNN0  11100111111N111111111  010001000011  111111111111  01111111111000000
005       0111100000  111111111111111111111  111111111111  0N100111NNNN  00000000000000000

The full data set has 2000 samples. I am trying to write code that will allow me to tell if the string of numbers for each of the 5 columns is homogenous (i.e. all 1 or 0) in all of my samples. Ideally, I'd also like to be able to differentiate between 1 and 0 in the cases where the answer is True. From my example, the expected results would be:

Sample    CCT6        GAT1         IMD3          PDR3          RIM15
001       TRUE (0)    TRUE (1)     FALSE         FALSE         FALSE
002       TRUE (1)    FALSE        TRUE (0)      FALSE         TRUE (0)
003       FALSE       TRUE (0)     FALSE         TRUE (0)      TRUE (1)
004       FALSE       FALSE        FALSE         TRUE (1)      FALSE
005       FALSE       TRUE (1)     TRUE (1)      FALSE         TRUE (0)

Im not stuck on using logicals and I could use characters as long as they can be used to differentiate between the different classes. Ideally id like to return the results in a similar data frame.

I'm having trouble with the most basic first step here which is to have R tell if the string is comprised of all the same value. Ive tried using various expressions using grep and regexpr but have been unable to get a result back that I can use to apply the the entire data frame using ddply or something similar. Here are some examples of what I've tried for this step:

a = as.character("111111111111")
b = as.character("000000000000")
c = as.character("000000011110")


> grep("1",a)
[1] 1

> grep("1",c)
[1] 1

> regexpr("1",a)
[1] 1
attr(,"match.length")
[1] 1
> regexpr("1",c)
[1] 8
attr(,"match.length")
[1] 1

Id greatly appreciate any help to get me started with this problem or help me accomplish my larger goal.

like image 428
Sam Globus Avatar asked Oct 27 '11 03:10

Sam Globus


2 Answers

Here is a REGEX expression that will match zeros or ones with one or more chars:

(^[0]+$)|(^[1]+$)

Following will match: 0000 0 111111 11 1

This will not match: 000001

like image 78
Dan Avatar answered Nov 01 '22 20:11

Dan


Here's a complete solution. Probably overkill, but also kind of fun.

The key bit is the markTRUE function. It uses a backreference (\\1) to refer to the substring (either 0 or 1) that was previously matched by the first parenthesized subexpression.

The regular expression "^(0|1)(\\1)+$" says 'match any string that begins with either 0 or 1, and is then followed (until the end of the string) by 1 or more repetitions of the same character --- whatever it was'. Later in the same call to gsub(), I use the same backreference to substitute either "TRUE (0)" or "TRUE (1)", as appropriate.

First read in the data:

dat <- 
read.table(textConnection("
Sample     CCT6        GAT1                   IMD3           PDR3          RIM15
001       0000000000  111111111111111111111  010001000011  0N100111NNNN  01111111111NNNNNN
002       1111111111  111111111111111111000  000000000000  0N100111NNNN  00000000000000000
003       0NNNN00000  000000000000000000000  010001000011  000000000000  11111111111111111
004       000000NNN0  11100111111N111111111  010001000011  111111111111  01111111111000000
005       0111100000  111111111111111111111  111111111111  0N100111NNNN  00000000000000000"),
header=T)

Then unleash the regular expressions:

markTRUE <- function(X) {
    gsub(X, pattern = "^(0|1)(\\1)+$", 
         replacement = "TRUE (\\1)")
}

markFALSE <- function(X) {
    X[!grepl("TRUE", X)]  <- "FALSE"
    return(X)
}

dat[-1] <- lapply(dat[-1], markTRUE)
dat[-1] <- lapply(dat[-1], markFALSE)

dat
#   Sample     CCT6     GAT1     IMD3     PDR3    RIM15
# 1      1 TRUE (0) TRUE (1)    FALSE    FALSE    FALSE
# 2      2 TRUE (1)    FALSE    FALSE    FALSE TRUE (0)
# 3      3    FALSE TRUE (0)    FALSE TRUE (0) TRUE (1)
# 4      4    FALSE    FALSE    FALSE TRUE (1)    FALSE
# 5      5    FALSE TRUE (1) TRUE (1)    FALSE TRUE (0)
like image 42
Josh O'Brien Avatar answered Nov 01 '22 21:11

Josh O'Brien