Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Combining Multiple Categories into One

Let's say I have categories, 1 to 10, and I want to assign red to value 3 to 5, green to 1,6, and 7, and blue to 2, 8, 9, and 10.

How would I do this? If I try

df.cat.rename_categories(['red','green','blue'])

I get an error: ValueError: new categories need to have the same number of items than the old categories! but if I put this in

df.cat.rename_categories(['green','blue','red', 'red', 'red'
                        'green', 'green', 'blue', 'blue' 'blue'])

I'll get an error saying that there are duplicate values.

The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?

like image 379
Minh Mai Avatar asked Aug 28 '15 03:08

Minh Mai


2 Answers

OK, this is slightly simpler, hopefully will stimulate further conversation.

OP's example input:

>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>>                         'green', 'green', 'blue', 'blue' 'blue'])

This yields ValueError: Categorical categories must be unique as OP states.

My solution:

# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
    1: 'green',
    2: 'blue',
    3: 'red',
    4: 'red',
    5: 'red',
    6: 'green',
    7: 'green',
    8: 'blue',
    9: 'blue',
    10: 'blue' }

>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]

Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.

I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):

>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
  numbers colors
0       1  green
1       2   blue
2       3    red
3       4    red
4       5    red
5       6  green
6       7  green
7       8   blue
8       9   blue
9      10   blue

>>> df.colors

0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]

EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x]) with df.numbers.map(remap_cat_dict) (thanks @JohnE)

like image 131
vector07 Avatar answered Sep 22 '22 09:09

vector07


Seems pandas.explode released with pandas-0.25.0 (July 18, 2019) would fit right in there and hence avoid any looping -

# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}

In [151]: pd.Series(m).explode().sort_values()
Out[151]: 
green     1
blue      2
red       3
red       4
red       5
green     6
green     7
blue      8
blue      9
blue     10
dtype: object

So, the result is a pandas series that has all the required mappings from values:index. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.

# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()

1) Output as dict :

In [153]: dict(zip(s.values, s.index))
Out[153]: 
{1: 'green',
 2: 'blue',
 3: 'red',
 4: 'red',
 5: 'red',
 6: 'green',
 7: 'green',
 8: 'blue',
 9: 'blue',
 10: 'blue'}

2) Output as series :

In [154]: pd.Series(s.index, s.values)
Out[154]: 
1     green
2      blue
3       red
4       red
5       red
6     green
7     green
8      blue
9      blue
10     blue
dtype: object
like image 21
Divakar Avatar answered Sep 24 '22 09:09

Divakar