I have data in a file. CSV-like but multiple values per field are possible. I use get_dummies() to generate an overview of my column. What is in there and how often. Just like an histogram with nominal data. I want to see the missing (nan) values. But my code hides them.
I am using: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html
I can't use: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html the dummy_na would solve the problem
Reason: I need the sep parameter.
To illustrate the difference.
import pandas
data = pandas.read_csv("testdata.csv",sep=";")
Bla["a"].str.get_dummies(",").sum() #no nan values
pandas.get_dummies(Bla["a"],dummy_na=True).sum() #not separated
Data:
a;b
Test,Tes;
;a
Tes;a
T;b
I would expect:
T 1
Tes 2
Test 1
NaN 1
But the output is:
T 1
Tes 2
Test 1
dtype: int64
or
T 1
Tes 1
Test,Tes 1
NaN 1
dtype: int64
Happy to also use another function! Maybe the .str part is the problem. I have not quite figured out what that does.
First replace missing values by Series.fillna and then in index by rename to NaN:
print (data["a"].fillna('Missing').str.get_dummies(",").sum().rename({'Missing':np.nan}))
NaN 1
T 1
Tes 2
Test 1
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With