Delete group if NaN is present anywhere in multiple columns

Tags:

I am trying to clean my dataframe such that if my "Base_2007" and "Base_2011" column contains NA, then I should completely drop that county. In my case since both Counties contains NA both of them will be dropped. Thus empty dataset will be returned. Is it possible to do something like this?

Data:

  State  Year  Base_2007  Base_2011           County
0    AL  2012        NaN       14.0  Alabama_Country
1    AL  2013       12.0       20.0  Alabama_Country
2    AL  2014       13.0        NaN  Alabama_Country
3    DC  2011        NaN       20.0          Trenton
4    DC  2012       19.0        NaN          Trenton
5    DC  2013       20.0       21.0          Trenton
6    DC  2014       25.0       30.0          Trenton

Tail section of dataframe:

{'State': {82550: 'WY', 82551: 'WY', 82552: 'WY', 82553: 'WY', 82554: 'WY', 82555: 'WY', 82556: 'WY', 82557: 'WY', 82558: 'WY', 82559: 'WY'}, 'County': {82550: 'Weston', 82551: 'Weston', 82552: 'Weston', 82553: 'Weston', 82554: 'Weston', 82555: 'Weston', 82556: 'Weston', 82557: 'Weston', 82558: 'Weston', 82559: 'Weston'}, 'FIPS code': {82550: 56045, 82551: 56045, 82552: 56045, 82553: 56045, 82554: 56045, 82555: 56045, 82556: 56045, 82557: 56045, 82558: 56045, 82559: 56045}, 'Year': {82550: 2008, 82551: 2009, 82552: 2010, 82553: 2011, 82554: 2012, 82555: 2013, 82556: 2014, 82557: 2015, 82558: 2016, 82559: 2017}, 'Annual_pct_change': {82550: 6.52, 82551: -2.93, 82552: -5.61, 82553: 1.9, 82554: 5.16, 82555: -4.03, 82556: 7.69, 82557: -2.35, 82558: 1.67, 82559: 5.56}, 'HPI': {82550: 195.73, 82551: 189.99, 82552: 179.33, 82553: 182.73, 82554: 192.15, 82555: 184.4, 82556: 198.58, 82557: 193.9, 82558: 197.14, 82559: 208.11}, 'HPI1990': {82550: nan, 82551: nan, 82552: nan, 82553: nan, 82554: nan, 82555: nan, 82556: nan, 82557: nan, 82558: nan, 82559: nan}, 'HPI2000': {82550: 190.09, 82551: 184.51, 82552: 174.16, 82553: 177.46, 82554: 186.61, 82555: 179.08, 82556: 192.86, 82557: 188.31, 82558: 191.46, 82559: 202.11}, 'CountyName': {82550: 'Weston County', 82551: 'Weston County', 82552: 'Weston County', 82553: 'Weston County', 82554: 'Weston County', 82555: 'Weston County', 82556: 'Weston County', 82557: 'Weston County', 82558: 'Weston County', 82559: 'Weston County'}}

Head Section of DataFrame:

{'State': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL', 5: 'AL', 6: 'AL', 7: 'AL', 8: 'AL', 9: 'AL'}, 'County': {0: 'Autauga', 1: 'Autauga', 2: 'Autauga', 3: 'Autauga', 4: 'Autauga', 5: 'Autauga', 6: 'Autauga', 7: 'Autauga', 8: 'Autauga', 9: 'Autauga'}, 'FIPS code': {0: 1001, 1: 1001, 2: 1001, 3: 1001, 4: 1001, 5: 1001, 6: 1001, 7: 1001, 8: 1001, 9: 1001}, 'Year': {0: 1986, 1: 1987, 2: 1988, 3: 1989, 4: 1990, 5: 1991, 6: 1992, 7: 1993, 8: 1994, 9: 1995}, 'Annual_pct_change': {0: nan, 1: -2.17, 2: 3.24, 3: 4.16, 4: -0.35, 5: 2.69, 6: 2.85, 7: 3.34, 8: 4.33, 9: 3.48}, 'HPI': {0: 100.0, 1: 97.83, 2: 100.99, 3: 105.19, 4: 104.82, 5: 107.64, 6: 110.7, 7: 114.4, 8: 119.35, 9: 123.5}, 'HPI1990': {0: 95.4, 1: 93.33, 2: 96.35, 3: 100.36, 4: 100.0, 5: 102.69, 6: 105.61, 7: 109.14, 8: 113.86, 9: 117.82}, 'HPI2000': {0: 71.03, 1: 69.49, 2: 71.74, 3: 74.72, 4: 74.45, 5: 76.46, 6: 78.63, 7: 81.26, 8: 84.77, 9: 87.72}, 'CountyName': {0: 'Autauga County', 1: 'Autauga County', 2: 'Autauga County', 3: 'Autauga County', 4: 'Autauga County', 5: 'Autauga County', 6: 'Autauga County', 7: 'Autauga County', 8: 'Autauga County', 9: 'Autauga County'}}

Note: In above dput Base_2007= HPI1990, BASE_2011=HPI2000

404

asked Feb 14 '19 03:02

Data_is_Power

1 Answers

I tested this on the below dataset (this also requires the NA to be replaced by np.nan as df = df.replace('NA', np.nan) if they are strings)

print(df)

  State  Year  Base_2007  Base_2011           County
0    AL  2012        NaN       14.0  Alabama_Country
1    AL  2013       12.0       20.0  Alabama_Country
2    AL  2014       13.0        NaN  Alabama_Country
3    DC  2011        NaN       20.0          Trenton
4    DC  2012       19.0        NaN          Trenton
5    DC  2013       20.0       21.0          Trenton
6    DC  2014       25.0       30.0          Trenton
7    DM  2013       34.0       45.0            Dummy
8    DM  2012       34.0       45.0            Dummy

Removing County containing NaN works by using :

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\
                                        any(axis=1),'County'])]
print(df_new)

  State  Year  Base_2007  Base_2011 County
7    DM  2013       34.0       45.0  Dummy
8    DM  2012       34.0       45.0  Dummy

I will update the explanation shortly.

Explanation

The following finds any NaN rows based on subset of Base_2007 and Base_2011

df[['Base_2007','Base_2011']].isna().any(axis=1)
0     True
1    False
2     True
3     True
4     True
5    False
6    False
7    False
8    False

Taking the above output as a boolean mask , we call the df.loc[] function as:

df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County']

which gives:

0    Alabama_Country
2    Alabama_Country
3            Trenton
4            Trenton

Note we are taking only the County column under **df.loc[]**. the reason for this is the next step.

We take the output above and find if any of cells in County column from the original dataframe exists in the output we got above by using s.isin()

This returns True for rows in County which is present in the output of df.loc[].

Then we negate them by an invert ~ which turns all True to False and vice versa.

~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().any(axis=1),'County'])
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True

Once we have this ready, we apply the same logic as df.loc[] .

Finally we get the dataframe which returns only those County which doesnot have a NaN in Base_2007 and Base_2011.

Note : if we want the index to start from 0 and not the slice of the dataframe, we can add a reset_index(drop=True) to the end of the code as:

df_new=df.loc[~df.County.isin(df.loc[df[['Base_2007','Base_2011']].isna().\                                 
                    any(axis=1),'County'])].reset_index(drop=True)

   State  Year  Base_2007  Base_2011 County
0    DM  2013       34.0       45.0  Dummy
1    DM  2012       34.0       45.0  Dummy

135

answered Oct 01 '22 03:10

anky

Related questions
                            
                                Python Selenium Exception AttributeError: "'Service' object has no attribute 'process'" in selenium.webdriver.ie.service.Service
                            
                                Python Pandas Drop Duplicates keep second to last
                            
                                Result of -1%7 is different in javascript(-1) and python(6)
                            
                                How to write a Pandas Dataframe to existing Django model
                            
                                Write text in particular font color in MS word using python-docx
                            
                                Updating the values of variables inside a namedtuple() structure
                            
                                UserWarning: Label not :NUMBER: is present in all training examples
                            
                                Using google.protobuf.Any in python file
                            
                                How to use boto3 to get a list of EBS snapshots owned by me?
                            
                                How to check for correlation among continuous and categorical variables?
                            
                                In SQLalchemy, can I have a column with multiple strings?
                            
                                Dictionary value as function to be called when key is accessed, without using "()"
                            
                                Lightfm: handling user and item cold-start
                            
                                What is the usage of tag over the cell in jupyter?
                            
                                typehints -> None or leave blank
                            
                                Django UserAdmin's add_fieldsets?
                            
                                Is there a way to "fork" a list in two based on a condition [duplicate]
                            
                                Python Pandas Assign a Dictionary Value to a Dataframe Column Based on Dictionary Key
                            
                                Flask session don't persist data
                            
                                docstring in class or __init__ constructor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Delete group if NaN is present anywhere in multiple columns

Tags:

python

pandas

dataframe

Data_is_Power

People also ask

1 Answers

anky

Recent Activity

Donate For Us