So I have many pandas data frames with 3 columns of categorical variables:
D F False
T F False
D F False
T F False
The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).
I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).
What is the most efficient way to do this in pandas?
So, all_combination_df
is a df with all possible combination of the factors. I am trying to turn df such as big_df
to a Series with unique numbers in it
import pandas, itertools
def expand_grid(data_dict):
"""Create a dataframe from every combination of given values."""
rows = itertools.product(*data_dict.values())
return pandas.DataFrame.from_records(rows, columns=data_dict.keys())
all_combination_df = expand_grid(
{'variable_1': ['D', 'A', 'T'],
'variable_2': ['C', 'A', 'B'],
'variable_3' : [True, False]})
big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])
For categorical data you can use Pandas string functions to filter the data. The startswith() function returns rows where a given column contains values that start with a certain value, and endswith() which returns rows with values that end with a certain value.
This is done using the groupby() method given in pandas. It returns all the combinations of groupby columns. Along with groupyby we have to pass an aggregate function with it to ensure that on what basis we are going to group our variables. Some aggregate function are mean(), sum(), count() etc.
Encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.
UPDATE: as @user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory
I would try to use factorize method:
In [112]: df['category'] = \
...: pd.Categorical(
...: pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
...:
In [113]: df
Out[113]:
a b c category
0 A X True 0
1 B Y False 1
2 A X True 0
3 C Z False 2
4 A Z True 3
5 C Z True 4
6 B Y False 1
7 C Z False 2
In [114]: df.dtypes
Out[114]:
a object
b object
c bool
category category
dtype: object
Explanation: this simple way we can glue all columns into a single series:
In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0 A~X~1
1 B~Y~0
2 A~X~1
3 C~Z~0
4 A~Z~1
5 C~Z~1
6 B~Y~0
7 C~Z~0
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With