test = {'ngrp' : ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']}
test = pd.DataFrame(test)
dummy = pd.get_dummies(test['ngrp'], drop_first = True)
This gives me:
Brooklyn Manhattan Queens Staten Island
0 0 1 0 0
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 0 0
I will get Bronx as my reference level (because that is what gets dropped), how do I change it to specify that Manhattan should be my reference level? My expected output is
Brooklyn Queens Staten Island Bronx
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
get_dummies
sorts your values (lexicographically) and then creates dummies. That's why you don't see "Bronx" in your initial result; its because it was the first sorted value in your column, so it was dropped first.
To avoid the behavior you see, enforce the ordering to be on a "first-seen" basis (i.e., convert it to an ordered categorical).
pd.get_dummies(
pd.Categorical(test['ngrp'], categories=test['ngrp'].unique(), ordered=True),
drop_first=True)
Brooklyn Queens Staten Island Bronx
0 0 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Of course, this has the side effect of returning dummies with categorical column names as the result, but that's almost never an issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With