Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to drop a row whose particular column is empty/NaN?

I have a csv file. I read it:

import pandas as pd data = pd.read_csv('my_data.csv', sep=',') data.head() 

It has output like:

id    city    department    sms    category 01    khi      revenue      NaN       0 02    lhr      revenue      good      1 03    lhr      revenue      NaN       0 

I want to remove all the rows where sms column is empty/NaN. What is efficient way to do it?

like image 223
Haroon S. Avatar asked Sep 07 '17 08:09

Haroon S.


People also ask

How do you delete a row with Na in a specific column in Python?

dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.

How do you drop a row with NaN in Python?

By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .


2 Answers

Use dropna with parameter subset for specify column for check NaNs:

data = data.dropna(subset=['sms']) print (data)    id city department   sms  category 1   2  lhr    revenue  good         1 

Another solution with boolean indexing and notnull:

data = data[data['sms'].notnull()] print (data)    id city department   sms  category 1   2  lhr    revenue  good         1 

Alternative with query:

print (data.query("sms == sms"))    id city department   sms  category 1   2  lhr    revenue  good         1 

Timings

#[300000 rows x 5 columns] data = pd.concat([data]*100000).reset_index(drop=True)  In [123]: %timeit (data.dropna(subset=['sms'])) 100 loops, best of 3: 19.5 ms per loop  In [124]: %timeit (data[data['sms'].notnull()]) 100 loops, best of 3: 13.8 ms per loop  In [125]: %timeit (data.query("sms == sms")) 10 loops, best of 3: 23.6 ms per loop 
like image 155
jezrael Avatar answered Sep 22 '22 15:09

jezrael


You can use the method dropna for this:

data.dropna(axis=0, subset=('sms', )) 

See the documentation for more details on the parameters.

Of course there are multiple ways to do this, and there are some slight performance differences. Unless performance is critical, I would prefer the use of dropna() as it is the most expressive.

import pandas as pd import numpy as np  i = 10000000  # generate dataframe with a few columns df = pd.DataFrame(dict(     a_number=np.random.randint(0,1e6,size=i),     with_nans=np.random.choice([np.nan, 'good', 'bad', 'ok'], size=i),     letter=np.random.choice(list('abcdefghijklmnop'), size=i))                  )  # using notebook %%timeit a = df.dropna(subset=['with_nans']) #1.29 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # using notebook %%timeit b = df[~df.with_nans.isnull()] #890 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # using notebook %%timeit c = df.query('with_nans == with_nans') #1.71 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
like image 37
Thijs D Avatar answered Sep 23 '22 15:09

Thijs D