I have a data frame, the first 5 lines of which looks as follows:
Sample CCT6 GAT1 IMD3 PDR3 RIM15
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000
The full data set has 2000 samples. I am trying to write code that will allow me to tell if the string of numbers for each of the 5 columns is homogenous (i.e. all 1 or 0) in all of my samples. Ideally, I'd also like to be able to differentiate between 1 and 0 in the cases where the answer is True
. From my example, the expected results would be:
Sample CCT6 GAT1 IMD3 PDR3 RIM15
001 TRUE (0) TRUE (1) FALSE FALSE FALSE
002 TRUE (1) FALSE TRUE (0) FALSE TRUE (0)
003 FALSE TRUE (0) FALSE TRUE (0) TRUE (1)
004 FALSE FALSE FALSE TRUE (1) FALSE
005 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)
Im not stuck on using logicals and I could use characters as long as they can be used to differentiate between the different classes. Ideally id like to return the results in a similar data frame.
I'm having trouble with the most basic first step here which is to have R tell if the string is comprised of all the same value. Ive tried using various expressions using grep
and regexpr
but have been unable to get a result back that I can use to apply the the entire data frame using ddply
or something similar. Here are some examples of what I've tried for this step:
a = as.character("111111111111")
b = as.character("000000000000")
c = as.character("000000011110")
> grep("1",a)
[1] 1
> grep("1",c)
[1] 1
> regexpr("1",a)
[1] 1
attr(,"match.length")
[1] 1
> regexpr("1",c)
[1] 8
attr(,"match.length")
[1] 1
Id greatly appreciate any help to get me started with this problem or help me accomplish my larger goal.
Here is a REGEX expression that will match zeros or ones with one or more chars:
(^[0]+$)|(^[1]+$)
Following will match: 0000 0 111111 11 1
This will not match: 000001
Here's a complete solution. Probably overkill, but also kind of fun.
The key bit is the markTRUE
function. It uses a backreference (\\1
) to refer to the substring (either 0
or 1
) that was previously matched by the first parenthesized subexpression.
The regular expression "^(0|1)(\\1)+$"
says 'match any string that begins with either 0
or 1
, and is then followed (until the end of the string) by 1 or more repetitions of the same character --- whatever it was'. Later in the same call to gsub()
, I use the same backreference to substitute either "TRUE (0)"
or "TRUE (1)"
, as appropriate.
First read in the data:
dat <-
read.table(textConnection("
Sample CCT6 GAT1 IMD3 PDR3 RIM15
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000"),
header=T)
Then unleash the regular expressions:
markTRUE <- function(X) {
gsub(X, pattern = "^(0|1)(\\1)+$",
replacement = "TRUE (\\1)")
}
markFALSE <- function(X) {
X[!grepl("TRUE", X)] <- "FALSE"
return(X)
}
dat[-1] <- lapply(dat[-1], markTRUE)
dat[-1] <- lapply(dat[-1], markFALSE)
dat
# Sample CCT6 GAT1 IMD3 PDR3 RIM15
# 1 1 TRUE (0) TRUE (1) FALSE FALSE FALSE
# 2 2 TRUE (1) FALSE FALSE FALSE TRUE (0)
# 3 3 FALSE TRUE (0) FALSE TRUE (0) TRUE (1)
# 4 4 FALSE FALSE FALSE TRUE (1) FALSE
# 5 5 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With