I have a df with dimension 58000*900 which contains replicates in row values, I want to traverse through every row and remove them. An example will make it more clear.
df
IDs Name col1 col2 col3
123 AB.C 1.3,1.3,1.3,1.3,1.3 0,0,0,0,0 5,5,5,5,5
234 CD-E 2,2,2,2,2 0.3,0.3,0.3,0.3,0.3 1,1,1,1,1
568 GHJ 123456 123456 123456
345 FGH 9,9,9,9,9 54,54,54,54,54 0,0,0,0,0
Apparently every value is replicated 5 times and in some cases their is a problem that there is no . or , separating the values.
What I want is drop those lines which does not contain either . or , and for the rest remove the duplicate values. So, the output will be:
IDs Name col1 col2 col3
123 AB.C 1.3 0 5
234 CD-E 2 0.3 1
345 FGH 9 54 0
dput(df)
structure(list(IDs = c(123L, 234L, 568L, 345L), Name = structure(c(1L,
2L, 4L, 3L), .Label = c("ABC", "CDE", "FGH", "GHJ"), class = "factor"),
col1 = structure(c(2L, 3L, 1L, 4L), .Label = c("123456",
"1.3,1.3,1.3,1.3,1.3", "2,2,2,2,2", "9,9,9,9,9"), class = "factor"),
col2 = structure(1:4, .Label = c("0,0,0,0,0", "0.3,0.3,0.3,0.3,0.3",
"123456", "54,54,54,54,54"), class = "factor"), col3 = structure(c(4L,
2L, 3L, 1L), .Label = c("0,0,0,0,0", "1,1,1,1,1", "123456",
"5,5,5,5,5"), class = "factor")), .Names = c("IDs", "Name",
"col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-4L))
First, we restructure your data in a long format using gather(), then we filter() for value with no , using grepl(). We then split the string in value into a list using strsplit() and make each element of the list it's own row using unnest(). We remove duplicated rows using distinct() and spread() back the key and values to columns.
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -(IDs:Name)) %>%
filter(grepl(",", value)) %>%
mutate(value = strsplit(value, ",")) %>%
unnest(value) %>%
distinct %>%
spread(key, value)
Which gives:
#Source: local data frame [3 x 5]
#
# IDs Name col1 col2 col3
# (int) (fctr) (chr) (chr) (chr)
#1 123 AB.C 1.3 0 5
#2 234 CD-E 2 0.3 1
#3 345 FGH 9 54 0
Another idea would be to use cSplit from splitstackshape:
df %>%
cSplit(., c("col1", "col2", "col3"), direction = "long", sep = ",") %>%
group_by(Name) %>%
filter(!any(is.na(.))) %>%
distinct
Which gives:
#Source: local data table [3 x 5]
#Groups: Name
#
# IDs Name col1 col2 col3
# (int) (fctr) (dbl) (dbl) (int)
#1 123 AB.C 1.3 0.0 5
#2 234 CD-E 2.0 0.3 1
#3 345 FGH 9.0 54.0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With