I have a csv file. I read it: <pre class="prettyprint"><code>import pandas as pd data = pd.read_csv('my_data.csv', sep=',') data.head() </code></pre> It has output like: <pre class="prettyprint"><code>id city department sms category 01 khi revenue NaN 0 02 lhr revenue good 1 03 lhr revenue NaN 0 </code></pre> I want to remove all the rows where <code>sms</code> column is empty/NaN. What is efficient way to do it?

Use <code>dropna</code> with parameter <code>subset</code> for specify column for check <code>NaN</code>s: <pre class="prettyprint"><code>data = data.dropna(subset=['sms']) print (data) id city department sms category 1 2 lhr revenue good 1 </code></pre> Another solution with <code>boolean indexing</code> and <code>notnull</code>: <pre class="prettyprint"><code>data = data[data['sms'].notnull()] print (data) id city department sms category 1 2 lhr revenue good 1 </code></pre> Alternative with <code>query</code>: <pre class="prettyprint"><code>print (data.query("sms == sms")) id city department sms category 1 2 lhr revenue good 1 </code></pre> Timings <pre class="prettyprint"><code>#[300000 rows x 5 columns] data = pd.concat([data]*100000).reset_index(drop=True) In [123]: %timeit (data.dropna(subset=['sms'])) 100 loops, best of 3: 19.5 ms per loop In [124]: %timeit (data[data['sms'].notnull()]) 100 loops, best of 3: 13.8 ms per loop In [125]: %timeit (data.query("sms == sms")) 10 loops, best of 3: 23.6 ms per loop </code></pre>

Python: How to drop a row whose particular column is empty/NaN?

Tags:

python

pandas

dataframe

I have a csv file. I read it:

import pandas as pd data = pd.read_csv('my_data.csv', sep=',') data.head()

It has output like:

id    city    department    sms    category 01    khi      revenue      NaN       0 02    lhr      revenue      good      1 03    lhr      revenue      NaN       0

I want to remove all the rows where sms column is empty/NaN. What is efficient way to do it?

223

asked Sep 07 '17 08:09

Haroon S.

2 Answers

Use dropna with parameter subset for specify column for check NaNs:

data = data.dropna(subset=['sms']) print (data)    id city department   sms  category 1   2  lhr    revenue  good         1

Another solution with boolean indexing and notnull:

data = data[data['sms'].notnull()] print (data)    id city department   sms  category 1   2  lhr    revenue  good         1

Alternative with query:

print (data.query("sms == sms"))    id city department   sms  category 1   2  lhr    revenue  good         1

Timings

#[300000 rows x 5 columns] data = pd.concat([data]*100000).reset_index(drop=True)  In [123]: %timeit (data.dropna(subset=['sms'])) 100 loops, best of 3: 19.5 ms per loop  In [124]: %timeit (data[data['sms'].notnull()]) 100 loops, best of 3: 13.8 ms per loop  In [125]: %timeit (data.query("sms == sms")) 10 loops, best of 3: 23.6 ms per loop

155

answered Sep 22 '22 15:09

jezrael

You can use the method dropna for this:

data.dropna(axis=0, subset=('sms', ))

See the documentation for more details on the parameters.

Of course there are multiple ways to do this, and there are some slight performance differences. Unless performance is critical, I would prefer the use of dropna() as it is the most expressive.

import pandas as pd import numpy as np  i = 10000000  # generate dataframe with a few columns df = pd.DataFrame(dict(     a_number=np.random.randint(0,1e6,size=i),     with_nans=np.random.choice([np.nan, 'good', 'bad', 'ok'], size=i),     letter=np.random.choice(list('abcdefghijklmnop'), size=i))                  )  # using notebook %%timeit a = df.dropna(subset=['with_nans']) #1.29 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # using notebook %%timeit b = df[~df.with_nans.isnull()] #890 ms ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # using notebook %%timeit c = df.query('with_nans == with_nans') #1.71 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Sep 23 '22 15:09

Thijs D

Related questions
                            
                                Paging output from python
                            
                                Python's hasattr on list values of dictionaries always returns false?
                            
                                Change the values of a NumPy array that are NOT in a list of indices
                            
                                Figure and axes methods in matplotlib
                            
                                How to get html with javascript rendered sourcecode by using selenium
                            
                                What does `from six.moves import urllib` do in Python?
                            
                                Documents and examples of PythonMagick
                            
                                class is not defined despite being imported
                            
                                How to find the last row in a column using openpyxl normal workbook?
                            
                                Anyone using Django in the "Enterprise"
                            
                                Writing a help for python script
                            
                                What's wrong with my except? [duplicate]
                            
                                Quadratic Program (QP) Solver that only depends on NumPy/SciPy?
                            
                                How to upload a file using an ajax call in flask
                            
                                How to display all label values in matplotlib
                            
                                Hide Axis in Bokeh
                            
                                Building multi-regression model throws error: `Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).`
                            
                                Trailing slash in Flask route
                            
                                Do datetime objects need to be deep-copied?
                            
                                Python Pandas Dataframe, remove all rows where 'None' is the value in any column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With