I am trying to clean my dataframe such that if my "Base_2007" and "Base_2011" column contains NA, then I should completely drop that county. In my case since both Counties contains NA both of them will be dropped. Thus empty dataset will be returned. Is it possible to do something like this?
Data:
State Year Base_2007 Base_2011 County
0 AL 2012 NaN 14.0 Alabama_Country
1 AL 2013 12.0 20.0 Alabama_Country
2 AL 2014 13.0 NaN Alabama_Country
3 DC 2011 NaN 20.0 Trenton
4 DC 2012 19.0 NaN Trenton
5 DC 2013 20.0 21.0 Trenton
6 DC 2014 25.0 30.0 Trenton
Tail section of dataframe:
{'State': {82550: 'WY', 82551: 'WY', 82552: 'WY', 82553: 'WY', 82554: 'WY', 82555: 'WY', 82556: 'WY', 82557: 'WY', 82558: 'WY', 82559: 'WY'}, 'County': {82550: 'Weston', 82551: 'Weston', 82552: 'Weston', 82553: 'Weston', 82554: 'Weston', 82555: 'Weston', 82556: 'Weston', 82557: 'Weston', 82558: 'Weston', 82559: 'Weston'}, 'FIPS code': {82550: 56045, 82551: 56045, 82552: 56045, 82553: 56045, 82554: 56045, 82555: 56045, 82556: 56045, 82557: 56045, 82558: 56045, 82559: 56045}, 'Year': {82550: 2008, 82551: 2009, 82552: 2010, 82553: 2011, 82554: 2012, 82555: 2013, 82556: 2014, 82557: 2015, 82558: 2016, 82559: 2017}, 'Annual_pct_change': {82550: 6.52, 82551: -2.93, 82552: -5.61, 82553: 1.9, 82554: 5.16, 82555: -4.03, 82556: 7.69, 82557: -2.35, 82558: 1.67, 82559: 5.56}, 'HPI': {82550: 195.73, 82551: 189.99, 82552: 179.33, 82553: 182.73, 82554: 192.15, 82555: 184.4, 82556: 198.58, 82557: 193.9, 82558: 197.14, 82559: 208.11}, 'HPI1990': {82550: nan, 82551: nan, 82552: nan, 82553: nan, 82554: nan, 82555: nan, 82556: nan, 82557: nan, 82558: nan, 82559: nan}, 'HPI2000': {82550: 190.09, 82551: 184.51, 82552: 174.16, 82553: 177.46, 82554: 186.61, 82555: 179.08, 82556: 192.86, 82557: 188.31, 82558: 191.46, 82559: 202.11}, 'CountyName': {82550: 'Weston County', 82551: 'Weston County', 82552: 'Weston County', 82553: 'Weston County', 82554: 'Weston County', 82555: 'Weston County', 82556: 'Weston County', 82557: 'Weston County', 82558: 'Weston County', 82559: 'Weston County'}}
Head Section of DataFrame:
{'State': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL', 5: 'AL', 6: 'AL', 7: 'AL', 8: 'AL', 9: 'AL'}, 'County': {0: 'Autauga', 1: 'Autauga', 2: 'Autauga', 3: 'Autauga', 4: 'Autauga', 5: 'Autauga', 6: 'Autauga', 7: 'Autauga', 8: 'Autauga', 9: 'Autauga'}, 'FIPS code': {0: 1001, 1: 1001, 2: 1001, 3: 1001, 4: 1001, 5: 1001, 6: 1001, 7: 1001, 8: 1001, 9: 1001}, 'Year': {0: 1986, 1: 1987, 2: 1988, 3: 1989, 4: 1990, 5: 1991, 6: 1992, 7: 1993, 8: 1994, 9: 1995}, 'Annual_pct_change': {0: nan, 1: -2.17, 2: 3.24, 3: 4.16, 4: -0.35, 5: 2.69, 6: 2.85, 7: 3.34, 8: 4.33, 9: 3.48}, 'HPI': {0: 100.0, 1: 97.83, 2: 100.99, 3: 105.19, 4: 104.82, 5: 107.64, 6: 110.7, 7: 114.4, 8: 119.35, 9: 123.5}, 'HPI1990': {0: 95.4, 1: 93.33, 2: 96.35, 3: 100.36, 4: 100.0, 5: 102.69, 6: 105.61, 7: 109.14, 8: 113.86, 9: 117.82}, 'HPI2000': {0: 71.03, 1: 69.49, 2: 71.74, 3: 74.72, 4: 74.45, 5: 76.46, 6: 78.63, 7: 81.26, 8: 84.77, 9: 87.72}, 'CountyName': {0: 'Autauga County', 1: 'Autauga County', 2: 'Autauga County', 3: 'Autauga County', 4: 'Autauga County', 5: 'Autauga County', 6: 'Autauga County', 7: 'Autauga County', 8: 'Autauga County', 9: 'Autauga County'}}
Note: In above dput
Base_2007= HPI1990, BASE_2011=HPI2000
If we need to drop such columns that contain NA, we can use the axis=column s parameter of DataFrame. dropna() to specify deleting the columns. By default, it removes the column where one or more values are missing.
By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .
I tested this on the below dataset (this also requires the NA
to be replaced by np.nan
as df = df.replace('NA', np.nan)
if they are strings)
print(df)
State Year Base_2007 Base_2011 County
0 AL 2012 NaN 14.0 Alabama_Country
1 AL 2013 12.0 20.0 Alabama_Country
2 AL 2014 13.0 NaN Alabama_Country
3 DC 2011 NaN 20.0 Trenton
4 DC 2012 19.0 NaN Trenton
5 DC 2013 20.0 21.0 Trenton
6 DC 2014 25.0 30.0 Trenton
7 DM 2013 34.0 45.0 Dummy
8 DM 2012 34.0 45.0 Dummy
Removing County
containing NaN works by using :
df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
any(axis=1),'County'])]
print(df_new)
State Year Base_2007 Base_2011 County
7 DM 2013 34.0 45.0 Dummy
8 DM 2012 34.0 45.0 Dummy
I will update the explanation shortly.
Explanation
The following finds any NaN rows based on subset of
Base_2007
andBase_2011
df[['Base_2007','Base_2011']].isna().any(axis=1)
0 True
1 False
2 True
3 True
4 True
5 False
6 False
7 False
8 False
Taking the above output as a boolean mask , we call the df.loc[]
function as:
df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County']
which gives:
0 Alabama_Country
2 Alabama_Country
3 Trenton
4 Trenton
Note we are taking only the County
column under **df.loc[]**
. the reason for this is the next step.
We take the output above and find if any of cells in County
column from the original dataframe exists in the output we got above by using s.isin()
This returns True for rows in County
which is present in the output of df.loc[]
.
Then we negate them by an invert ~
which turns all True
to False
and vice versa.
~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County'])
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
Once we have this ready, we apply the same logic as df.loc[]
.
Finally we get the dataframe which returns only those County which doesnot have a NaN in Base_2007
and Base_2011
.
Note : if we want the index to start from 0 and not the slice of the dataframe, we can add a reset_index(drop=True)
to the end of the code as:
df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
any(axis=1),'County'])].reset_index(drop=True)
State Year Base_2007 Base_2011 County
0 DM 2013 34.0 45.0 Dummy
1 DM 2012 34.0 45.0 Dummy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With