Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count comma separated unique values in a string

Tags:

string

r

unique

The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string. I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method? The actual data set contains 500k rows and 53 columns. Sample dataset :
df

c1      c2    c3  
aa      11   1,13,4,5,4,7,9    
bb      22   2,5,2,4,5,7,11,     
cc      33   11,14,3,1,    
dd      44   1,1,2,4,5,6,15,    
ee      55   4,3,3,1,14,17,

desired output:

c1        c2             c3             c4  
------ | ------   | ------          | -----   
aa     | 11       | 1,13,4,5,4,7,9  |  6    
------ | ------   | ------          | -----   
bb     | 22       | 2,5,2,4,5,7,11, |  5   
------ | ------   | ------          | -----   
cc     | 33       | 11,14,3,1,      |  4   
------ | ------   | ------          | -----   
dd     | 44       | 1,1,2,4,5,6,15, |  6       
------ | ------   | ------          | -----   
ee     | 55       | 4,3,3,1,7,17,7, |  5    
------ | ------   | ------          | -----  

Any help would be appreciated!

like image 933
Shubhangi Sharma Avatar asked May 12 '17 06:05

Shubhangi Sharma


People also ask

How can I count the number of comma separated values in a string?

SQL Pattern: How can I count the number of comma separated values in a string? (Community) Basically, you replace all occurrences of , with an empty string "" , then subtract its LENGTH from the LENGTH of the unadulterated string, which gives you the number of , characters.

How do you do a Countif unique value?

You can use the combination of the SUM and COUNTIF functions to count unique values in Excel. The syntax for this combined formula is = SUM(IF(1/COUNTIF(data, data)=1,1,0)). Here the COUNTIF formula counts the number of times each value in the range appears.

How do I count comma separated values in multiple cells in Excel?

Select the cell you will place the counting result, type the formula =LEN(A2)-LEN(SUBSTITUTE(A2,",","")) (A2 is the cell where you will count the commas) into it, and then drag this cell's AutoFill Handle to the range as you need.


1 Answers

Using strsplit with uniqueN from the data.table-package:

df$c4 <- sapply(strsplit(df$c3,','), uniqueN)

which gives:

> df
  c1 c2              c3 c4
1 aa 11  1,13,4,5,4,7,9  6
2 bb 22 2,5,2,4,5,7,11,  5
3 cc 33      11,14,3,1,  4
4 dd 44 1,1,2,4,5,6,15,  6
5 ee 55  4,3,3,1,14,17,  5

NOTE: if df$c3 is a factor-variable, wrap it in as.character: sapply(strsplit(as.character(df$c3), ','), uniqueN)


Another base R alternative for creating df$c4:

sapply(regmatches(df$c3, gregexpr('\\d+', df$c3)), function(x) length(unique(x)))

A tidyverse alternative:

library(dplyr)
library(tidyr)
df %>% 
  separate_rows(c3) %>% 
  filter(c3 != '') %>% 
  group_by(c1) %>% 
  summarise(c4 = n_distinct(c3)) %>% 
  left_join(df, .)
like image 126
Jaap Avatar answered Oct 25 '22 15:10

Jaap