I have a data frame in R which has a lot of duplicate records. I am interested in finding out how many records of each are in this data frame.
For example, I have this data frame:
Fake Name Fake ID Fake Status Fake Program
June 0003 Green PR1
June 0003 Green PR1
Television 202 Blue PR3
Television 202 Green PR3
Television 202 Green PR3
CRT 12 Red PR0
And from the above I would want to get something that's like similar to below:
Fake Name Fake ID Fake Status Fake Program COUNT
June 0003 Green PR1 2
Television 202 Blue PR3 1
Television 202 Green PR3 2
CRT 12 Red PR0 1
Any help would be appreciated. Thank you.
The following uses duplicated
to get the result data.frame and then rle
to get the counts.
res <- dat[!duplicated(dat), ]
d <- duplicated(dat) | duplicated(dat, fromLast = TRUE)
res$COUNT <- rle(d)$lengths
res
# Fake Name Fake ID Fake Status Fake Program COUNT
#1 June 0003 Green PR1 2
#3 Television 202 Blue PR3 1
#4 Television 202 Green PR3 2
#6 CRT 12 Red PR0 1
Use group_by_all
then count the number of rows with n
:
df %>% group_by_all() %>% summarise(COUNT = n())
# A tibble: 4 x 5
# Groups: Fake.Name, Fake.ID, Fake.Status [?]
# Fake.Name Fake.ID Fake.Status Fake.Program COUNT
# <fct> <int> <fct> <fct> <int>
#1 CRT 12 Red PR0 1
#2 June 3 Green PR1 2
#3 Television 202 Blue PR3 1
#4 Television 202 Green PR3 2
Or even better as from @Ryan's comment:
df %>% group_by_all %>% count
In base R, the table
function provides tabular multi-way counts of every factor combination in your data frame. The result can then be converted to data frame that matches your original structure, with an added "Freq" column containing counts.
data.frame(table(df))
# Fake.Name Fake.ID Fake.Status Fake.Program Freq
#1 CRT 0003 Blue PR0 0
#2 June 0003 Blue PR0 0
#3 Television 0003 Blue PR0 0
#4 CRT 12 Blue PR0 0
Of course, every combination might not be needed, so you can restrict it to the rows with positive counts:
subset(data.frame(table(df)), Freq > 0)
# Fake.Name Fake.ID Fake.Status Fake.Program Freq
#22 CRT 12 Red PR0 1
#38 June 0003 Green PR1 2
#63 Television 202 Blue PR3 1
#72 Television 202 Green PR3 2
you could use:
n_distinct(data$col)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With