I'm using pandas 0.18. I have loaded a dataframe from CSV using pd.read_csv()
, and it looks as thought the empty cells in CSV have loaded as NaN
in the dataframe.
Now I want to find the number of rows with an empty value in a particular column, but I'm struggling.
This is my dataframe:
ods id provider
0 A86016 NaN emis
1 L81042 463061 NaN
2 C84013 NaN tpp
3 G82228 462941 emis
4 C81083 NaN tpp
This is what I get from a df.describe()
:
ods id provider
count 9897 7186 9022
unique 8066 192 4
top N83028 463090 emis
freq 7 169 4860
I want to get all the rows where provider
was empty in the CSV. This is what I've tried:
>>> print len(df[df.provider == 'NaN'])
0
>>> print len(df[df.provider == np.nan])
0
I can see that there are some NaN
values in there (e.g. row 1) so what gives?
Also, why does pandas convert empty values in string columns like provider
to NaN
- wouldn't it make more sense to convert them to an empty string?
Use isnull
for comparing NaN
:
df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'},
'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan},
'provider': {0: 'emis', 1: np.nan, 2: 'tpp', 3: 'emis', 4: 'tpp'}})
print df
id ods provider
0 NaN A86016 emis
1 463061.0 L81042 NaN
2 NaN C84013 tpp
3 462941.0 G82228 emis
4 NaN C81083 tpp
print (df[df.provider.isnull()])
ods id provider
1 L81042 463061.0 NaN
print len(df[df.provider.isnull()])
1
If you need convert NaN
to `` use fillna
:
df.provider.fillna('', inplace=True)
print df
id ods provider
0 NaN A86016 emis
1 463061.0 L81042
2 NaN C84013 tpp
3 462941.0 G82228 emis
4 NaN C81083 tpp
Docs:
Warning
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
In [13]: df2['one'] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
But if nan
is string:
df = pd.DataFrame({'ods': {0: 'A86016', 1: 'L81042', 2: 'C84013', 3: 'G82228', 4: 'C81083'},
'id': {0: np.nan, 1: 463061.0, 2: np.nan, 3: 462941.0, 4: np.nan},
'provider': {0: 'emis', 1: 'nan', 2: 'tpp', 3: 'emis', 4: 'tpp'}})
print df
ods id provider
0 A86016 NaN emis
1 L81042 463061.0 nan
2 C84013 NaN tpp
3 G82228 462941.0 emis
4 C81083 NaN tpp
print (df[df.provider == 'nan'])
ods id provider
1 L81042 463061.0 nan
do you know why pandas imports empty strings as NaN rather than empty strings?
See docs (bold by me):
na_values : str, list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''.
You can first store the na values, and then drop all the rest:
without_na = df['provider'].dropna()
df[~df.index.isin(without_na.index)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With