In Pandas, how to create a unique ID based on the combination of many columns?

Tags:

pandas

I have a very large dataset, that looks like

df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})

df
Out[173]: 
            B              C
0  john smith  indiana jones
1    john doe   duck mc duck
2  adam smith         batman
3    john doe   duck mc duck
4         NaN            NaN

I need to create a ID variable, that is unique for every B-C combination. That is, the output should be

            B              C   ID
0  john smith  indiana jones   1
1    john doe   duck mc duck   2
2  adam smith         batman   3
3    john doe   duck mc duck   2 
4         NaN            NaN   0

I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly. I use:

df['combined_id']=(df.B+df.C).rank(method='dense')

but the output is float64 and takes a lot of memory. Can we do better? Thanks!

945

asked Apr 15 '16 12:04

ℕʘʘḆḽḘ

2 Answers

I think you can use factorize:

df['combined_id'] = pd.factorize(df.B+df.C)[0]
print df
            B              C  combined_id
0  john smith  indiana jones            0
1    john doe   duck mc duck            1
2  adam smith         batman            2
3    john doe   duck mc duck            1
4         NaN            NaN           -1

170

answered Sep 30 '22 09:09

jezrael

Making jezrael's answer a little more general (what if the columns were not string?), you can use this compact function:

def make_identifier(df):
    str_id = df.apply(lambda x: '_'.join(map(str, x)), axis=1)
    return pd.factorize(str_id)[0]

df['combined_id'] = make_identifier(df[['B','C']])

answered Sep 30 '22 11:09

Nolan Conaway

Related questions
                            
                                What argument can we pass to super()?
                            
                                Optimize Display for Django WebApp depending on Mobile Device vs Desktop [closed]
                            
                                Ignore dates and times while parsing YAML?
                            
                                Pythonic way to use range with excluded last number?
                            
                                How to filter stdout in python logging
                            
                                How do I replace a closed event loop?
                            
                                Python - is there a way to make all strings unicode in a project by default?
                            
                                Cookies must be enabled in your browser [Python Requests]
                            
                                Using Python Higher Order Functions to Manipulate Lists
                            
                                python opencv cv2 matchTemplate with transparency
                            
                                How to change screen transition in different screens
                            
                                Lambda and S3 Permission denied when want to create file
                            
                                What parameters does Django's models.DO_NOTHING expect?
                            
                                Retrieving data from a yaml file based on a Python list
                            
                                SQLAlchemy best practices: when / how to configure a scoped_session?
                            
                                how to get last n bits by bit-op?
                            
                                why are empty numpy arrays not printed
                            
                                How can I turn a csv file into a list of list in python
                            
                                Delete second row of header in PANDAS
                            
                                SQL Alchemy: How to bulk update values from a list of dicts

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With