Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create dummy variables using pandas with reference to one value?

test = {'ngrp' : ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']}
test = pd.DataFrame(test)
dummy = pd.get_dummies(test['ngrp'], drop_first = True)

This gives me:

   Brooklyn  Manhattan  Queens  Staten Island
0         0          1       0              0
1         1          0       0              0
2         0          0       1              0
3         0          0       0              1
4         0          0       0              0

I will get Bronx as my reference level (because that is what gets dropped), how do I change it to specify that Manhattan should be my reference level? My expected output is

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1
like image 628
John peter Avatar asked Nov 15 '19 01:11

John peter


1 Answers

get_dummies sorts your values (lexicographically) and then creates dummies. That's why you don't see "Bronx" in your initial result; its because it was the first sorted value in your column, so it was dropped first.

To avoid the behavior you see, enforce the ordering to be on a "first-seen" basis (i.e., convert it to an ordered categorical).

pd.get_dummies(
    pd.Categorical(test['ngrp'], categories=test['ngrp'].unique(), ordered=True), 
    drop_first=True)                                       

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1

Of course, this has the side effect of returning dummies with categorical column names as the result, but that's almost never an issue.

like image 164
cs95 Avatar answered Sep 29 '22 16:09

cs95