Let's say I have categories, 1 to 10, and I want to assign red
to value 3 to 5, green
to 1,6, and 7, and blue
to 2, 8, 9, and 10.
How would I do this? If I try
df.cat.rename_categories(['red','green','blue'])
I get an error: ValueError: new categories need to have the same number of items than the old categories!
but if I put this in
df.cat.rename_categories(['green','blue','red', 'red', 'red'
'green', 'green', 'blue', 'blue' 'blue'])
I'll get an error saying that there are duplicate values.
The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?
OK, this is slightly simpler, hopefully will stimulate further conversation.
OP's example input:
>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>> 'green', 'green', 'blue', 'blue' 'blue'])
This yields ValueError: Categorical categories must be unique
as OP states.
My solution:
# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue' }
>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]
Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.
I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):
>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
numbers colors
0 1 green
1 2 blue
2 3 red
3 4 red
4 5 red
5 6 green
6 7 green
7 8 blue
8 9 blue
9 10 blue
>>> df.colors
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]
EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x])
with df.numbers.map(remap_cat_dict)
(thanks @JohnE)
Seems pandas.explode
released with pandas-0.25.0
(July 18, 2019)
would fit right in there and hence avoid any looping -
# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
In [151]: pd.Series(m).explode().sort_values()
Out[151]:
green 1
blue 2
red 3
red 4
red 5
green 6
green 7
blue 8
blue 9
blue 10
dtype: object
So, the result is a pandas series that has all the required mappings from values:index
. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.
# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()
1) Output as dict :
In [153]: dict(zip(s.values, s.index))
Out[153]:
{1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue'}
2) Output as series :
In [154]: pd.Series(s.index, s.values)
Out[154]:
1 green
2 blue
3 red
4 red
5 red
6 green
7 green
8 blue
9 blue
10 blue
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With