I have a csv file. I read it:
import pandas as pd data = pd.read_csv('my_data.csv', sep=',') data.head()
It has output like:
id city department sms category 01 khi revenue NaN 0 02 lhr revenue good 1 03 lhr revenue NaN 0
I want to remove all the rows where sms
column is empty/NaN. What is efficient way to do it?
dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .
Use dropna
with parameter subset
for specify column for check NaN
s:
data = data.dropna(subset=['sms']) print (data) id city department sms category 1 2 lhr revenue good 1
Another solution with boolean indexing
and notnull
:
data = data[data['sms'].notnull()] print (data) id city department sms category 1 2 lhr revenue good 1
Alternative with query
:
print (data.query("sms == sms")) id city department sms category 1 2 lhr revenue good 1
Timings
#[300000 rows x 5 columns] data = pd.concat([data]*100000).reset_index(drop=True) In [123]: %timeit (data.dropna(subset=['sms'])) 100 loops, best of 3: 19.5 ms per loop In [124]: %timeit (data[data['sms'].notnull()]) 100 loops, best of 3: 13.8 ms per loop In [125]: %timeit (data.query("sms == sms")) 10 loops, best of 3: 23.6 ms per loop
You can use the method dropna
for this:
data.dropna(axis=0, subset=('sms', ))
See the documentation for more details on the parameters.
Of course there are multiple ways to do this, and there are some slight performance differences. Unless performance is critical, I would prefer the use of dropna()
as it is the most expressive.
import pandas as pd import numpy as np i = 10000000 # generate dataframe with a few columns df = pd.DataFrame(dict( a_number=np.random.randint(0,1e6,size=i), with_nans=np.random.choice([np.nan, 'good', 'bad', 'ok'], size=i), letter=np.random.choice(list('abcdefghijklmnop'), size=i)) ) # using notebook %%timeit a = df.dropna(subset=['with_nans']) #1.29 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # using notebook %%timeit b = df[~df.with_nans.isnull()] #890 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # using notebook %%timeit c = df.query('with_nans == with_nans') #1.71 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With