I have this DataFrame
and want only the records whose EPS
column is not NaN
:
>>> df STK_ID EPS cash STK_ID RPT_Date 601166 20111231 601166 NaN NaN 600036 20111231 600036 NaN 12 600016 20111231 600016 4.3 NaN 601009 20111231 601009 NaN NaN 601939 20111231 601939 2.5 NaN 000001 20111231 000001 NaN NaN
...i.e. something like df.drop(....)
to get this resulting dataframe:
STK_ID EPS cash STK_ID RPT_Date 600016 20111231 600016 4.3 NaN 601939 20111231 601939 2.5 NaN
How do I do that?
To drop all the rows with the NaN values, you may use df. dropna().
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Pandas DataFrame dropna() function is used to remove rows and columns with Null/NaN values. By default, this function returns a new DataFrame and the source DataFrame remains unchanged. We can create null values using None, pandas. NaT, and numpy.
One of the fastest ways to delete rows that contain a specific value or fulfill a given condition is to filter these. Once you have the filtered data, you can delete all these rows (while the remaining rows remain intact).
Don't drop, just take the rows where EPS is not NA:
df = df[df['EPS'].notna()]
This question is already resolved, but...
...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna()
, is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.
In [24]: df = pd.DataFrame(np.random.randn(10,3)) In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan; In [26]: df Out[26]: 0 1 2 0 NaN NaN NaN 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN
In [27]: df.dropna() #drop all rows that have any NaN values Out[27]: 0 1 2 1 2.677677 -1.466923 -0.750366 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295
In [28]: df.dropna(how='all') #drop only if ALL columns are NaN Out[28]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN
In [29]: df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN Out[29]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN
In [30]: df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question) Out[30]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN
There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.
Pretty handy!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With