Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:

df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})

df  col1    col2
0      a    ddd
1    bbb    eeeee
2     cc    ff
3           gggggg

It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.

My code so far (but this might be wrong):

unique_symbols = [0]*203
i = 0
for col in df.columns:
    observed_symbols = []
    df_temp = df[[col]]
    df_temp = df_temp.astype('str')

    #This part is where I am not so sure
    for index, row in df_temp.iterrows():
        pass

    if symbol not in observed_symbols:
        observed_symbols.append(symbol)
    unique_symbols[i] = len(observed_symbols)
    i += 1

Thanks in advance

like image 205
Alien13 Avatar asked Dec 03 '22 20:12

Alien13


1 Answers

Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.

{c : len(set(''.join(df[c]))) for c in df.columns}

{'col1': 3, 'col2': 4}

Option 2
agg
If you want to stay in pandas space.

df.agg(lambda x: set(''.join(x)), axis=0).str.len()

Or,

df.agg(lambda x: len(set(''.join(x))), axis=0)

col1    3
col2    4
dtype: int64
like image 162
cs95 Avatar answered Dec 06 '22 09:12

cs95