Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

assign hash to row of categorical data in pandas

So I have many pandas data frames with 3 columns of categorical variables:

             D              F     False
             T              F     False
             D              F     False
             T              F     False

The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).

I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).

What is the most efficient way to do this in pandas?

So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it

import pandas, itertools

def expand_grid(data_dict):
    """Create a dataframe from every combination of given values."""
    rows = itertools.product(*data_dict.values())
    return pandas.DataFrame.from_records(rows, columns=data_dict.keys())

all_combination_df = expand_grid(
                           {'variable_1': ['D', 'A', 'T'],
                           'variable_2': ['C', 'A', 'B'],
                           'variable_3'     : [True, False]})

big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])
like image 800
user189035 Avatar asked Nov 05 '16 12:11

user189035


People also ask

How do you filter categorical data in pandas?

For categorical data you can use Pandas string functions to filter the data. The startswith() function returns rows where a given column contains values that start with a certain value, and endswith() which returns rows with values that end with a certain value.

How do you group categorical variables in pandas?

This is done using the groupby() method given in pandas. It returns all the combinations of groupby columns. Along with groupyby we have to pass an aggregate function with it to ensure that on what basis we are going to group our variables. Some aggregate function are mean(), sum(), count() etc.

What does the PD factorize () function do?

Encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.


1 Answers

UPDATE: as @user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory

I would try to use factorize method:

In [112]: df['category'] = \
     ...:     pd.Categorical(
     ...:         pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
     ...:

In [113]: df
Out[113]:
   a  b      c category
0  A  X   True        0
1  B  Y  False        1
2  A  X   True        0
3  C  Z  False        2
4  A  Z   True        3
5  C  Z   True        4
6  B  Y  False        1
7  C  Z  False        2

In [114]: df.dtypes
Out[114]:
a             object
b             object
c               bool
category    category
dtype: object

Explanation: this simple way we can glue all columns into a single series:

In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0    A~X~1
1    B~Y~0
2    A~X~1
3    C~Z~0
4    A~Z~1
5    C~Z~1
6    B~Y~0
7    C~Z~0
dtype: object
like image 120
MaxU - stop WAR against UA Avatar answered Oct 20 '22 13:10

MaxU - stop WAR against UA