assign hash to row of categorical data in pandas

Tags:

So I have many pandas data frames with 3 columns of categorical variables:

             D              F     False
             T              F     False
             D              F     False
             T              F     False

The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).

I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).

What is the most efficient way to do this in pandas?

So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it

Click to copy

import pandas, itertools

def expand_grid(data_dict):
    """Create a dataframe from every combination of given values."""
    rows = itertools.product(*data_dict.values())
    return pandas.DataFrame.from_records(rows, columns=data_dict.keys())

all_combination_df = expand_grid(
                           {'variable_1': ['D', 'A', 'T'],
                           'variable_2': ['C', 'A', 'B'],
                           'variable_3'     : [True, False]})

big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])

800

asked Nov 05 '16 12:11

user189035

1 Answers

UPDATE: as @user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory

I would try to use factorize method:

Click to copy

In [112]: df['category'] = \
     ...:     pd.Categorical(
     ...:         pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
     ...:

In [113]: df
Out[113]:
   a  b      c category
0  A  X   True        0
1  B  Y  False        1
2  A  X   True        0
3  C  Z  False        2
4  A  Z   True        3
5  C  Z   True        4
6  B  Y  False        1
7  C  Z  False        2

In [114]: df.dtypes
Out[114]:
a             object
b             object
c               bool
category    category
dtype: object

Explanation: this simple way we can glue all columns into a single series:

Click to copy

In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0    A~X~1
1    B~Y~0
2    A~X~1
3    C~Z~0
4    A~Z~1
5    C~Z~1
6    B~Y~0
7    C~Z~0
dtype: object

120

answered Oct 20 '22 13:10

MaxU - stop WAR against UA

Related questions
                            
                                Writing a Pandas Dataframe to MySQL
                            
                                Set last non-zero element of each row to zero - NumPy
                            
                                How to sum values grouped by a categorical column in pandas?
                            
                                How to delete a file without an extension?
                            
                                Python Memoryview vs Bytearray?
                            
                                Getting low test accuracy using Tensorflow batch_norm function
                            
                                How I order behavior IDublinCore in Dexterity Type?
                            
                                PyQt5 : how to Sort a QTableView when you click on the headers of a QHeaderView?
                            
                                aggregate Dataframe pyspark
                            
                                How is __subclasses__ method implemented in CPython?
                            
                                Sort list of list in Python according to a specific column
                            
                                compare two pandas data frame
                            
                                Why can Nodejs do file I/O async while Python asyncio can't?
                            
                                django Field names must not end with an underscore. Field names must not contain __
                            
                                Cert not due for renewal, but simulating renewal for dry run
                            
                                Matplotlib Display Dollar Signs in Tick Labels (Strings)
                            
                                Remotely connect to MySQL with Python mysql.connector
                            
                                python bluetooth - check connection status
                            
                                Behavior of python's select() with partial recv() on SSL socket
                            
                                Python. How to print text to console as hyperlink?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

assign hash to row of categorical data in pandas

Tags:

python

pandas

dataframe

hash

user189035

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us