Count comma separated unique values in a string

Tags:

The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string. I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method? The actual data set contains 500k rows and 53 columns. Sample dataset :
df

c1      c2    c3  
aa      11   1,13,4,5,4,7,9    
bb      22   2,5,2,4,5,7,11,     
cc      33   11,14,3,1,    
dd      44   1,1,2,4,5,6,15,    
ee      55   4,3,3,1,14,17,

desired output:

c1        c2             c3             c4  
------ | ------   | ------          | -----   
aa     | 11       | 1,13,4,5,4,7,9  |  6    
------ | ------   | ------          | -----   
bb     | 22       | 2,5,2,4,5,7,11, |  5   
------ | ------   | ------          | -----   
cc     | 33       | 11,14,3,1,      |  4   
------ | ------   | ------          | -----   
dd     | 44       | 1,1,2,4,5,6,15, |  6       
------ | ------   | ------          | -----   
ee     | 55       | 4,3,3,1,7,17,7, |  5    
------ | ------   | ------          | -----

Any help would be appreciated!

933

asked May 12 '17 06:05

Shubhangi Sharma

1 Answers

Using strsplit with uniqueN from the data.table-package:

df$c4 <- sapply(strsplit(df$c3,','), uniqueN)

which gives:

> df
  c1 c2              c3 c4
1 aa 11  1,13,4,5,4,7,9  6
2 bb 22 2,5,2,4,5,7,11,  5
3 cc 33      11,14,3,1,  4
4 dd 44 1,1,2,4,5,6,15,  6
5 ee 55  4,3,3,1,14,17,  5

NOTE: if df$c3 is a factor-variable, wrap it in as.character: sapply(strsplit(as.character(df$c3), ','), uniqueN)

Another base R alternative for creating df$c4:

sapply(regmatches(df$c3, gregexpr('\\d+', df$c3)), function(x) length(unique(x)))

A tidyverse alternative:

library(dplyr)
library(tidyr)
df %>% 
  separate_rows(c3) %>% 
  filter(c3 != '') %>% 
  group_by(c1) %>% 
  summarise(c4 = n_distinct(c3)) %>% 
  left_join(df, .)

126

answered Oct 25 '22 15:10

Jaap

Related questions
                            
                                How to get a frequency table of all columns of complete data frame in R?
                            
                                Replace numbers in matrix with string
                            
                                Shiny Reactive ggplot Output
                            
                                Change line color depending on y value with ggplot2
                            
                                Multi-line ggplot Title With Different Font Size, Face, etc [duplicate]
                            
                                Problem placing error bars at the center of the columns in ggplot()
                            
                                Counting the number of times the next element in a vector is different to the previous one
                            
                                How do I test for numeric values in a dataframe of characters, and convert those to numeric?
                            
                                vectorize cumsum by factor in R
                            
                                Integrate over an integral in R
                            
                                What is the use of RTVS if you have Rstudio already?
                            
                                How to select range of columns in a dataframe based on their name and not their indexes?
                            
                                Shiny in R: How to set an input value to NULL after clicking on a button?
                            
                                R Test if Lists Contain String
                            
                                Footer alignment in shiny app dashboard
                            
                                Fitting a linear model with multiple LHS
                            
                                High (or very high) order polynomial regression in R (or alternatives?)
                            
                                R flatten nested data.table
                            
                                Counting the number of occurrences
                            
                                Normalize by Group

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count comma separated unique values in a string

Tags:

string

r

unique

Shubhangi Sharma

People also ask

1 Answers

Jaap

Recent Activity

Donate For Us