Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retaining NaN values after get_dummies in Pandas

I have a dataframe 'df' like this -

Id    v1    v2
0     A     0.23
1     B     0.65
2     NaN   0.87

If I use

df1 = get_dummies(df)
df1

I get

Id    v1_A    v1_B    v2
0     1       0       0.23
1     0       1       0.65
2     0       0       0.87 .

How can I get the following efficiently?

Id    v1_A    v1_B    v2
0     1       0       0.23
1     0       1       0.65
2     NaN     NaN     0.87 .

I was using this initially, but it takes too long

import numpy as np    
dfv1 = df[[v1]]    #Slicing the v1 column
dfs = get_dummies(dfv1)    
dfsum = dfs.apply(np.sum, axis=1)    #Calculating row by row sum of dfs
for i in range(dfs.size):    #Iterating over the entire dataframe
    if dfsum.iloc[i]==0:     #and if the sum is zero for some 'i'
        dfs.iloc[i][:]==np.nan    #changing corresponding row to NaN
del df['v1']    #Deleting original column
df = pandas.concat([df, dfs], axis=1)    #Appending the new one

I am using Python 3.5.1 on Jupyter, and Pandas 0.18 . Thanks.

like image 391
Gaurav Waghmare Avatar asked Apr 15 '16 17:04

Gaurav Waghmare


2 Answers

I'll use a simple data frame as an example:

df1 = pd.DataFrame([['A', 'A'],[np.nan, 'B'], ['C', np.nan]])

>>> df1
     0    1
0    A    A
1  NaN    B
2    C  NaN

Then one-hot-encode it:

df1_ohe = pd.get_dummies(df1, dummy_na=True)

>>> df1_ohe
   0_A  0_C  0_nan  1_A  1_B  1_nan
0    1    0      0    1    0      0
1    0    0      1    0    1      0
2    0    1      0    0    0      1

Now get a subset of this data frame, containing only the NaN columns:

nan_df = df1_ohe.loc[:, df1_ohe.columns.str.endswith("_nan")]

>>> nan_df
   0_nan  1_nan
0      0      0
1      1      0
2      0      1

Finally, use a little regex and iterate over each row in the data frame and each NaN column.

If this position [row, NaN column] contains 1, then that position on the original data frame (before OHE) is a NaN.

Therefore, I use regex to identify the original columns "col_id" (i.e., 1_nan gives me 1, which is the column that contains NaN in the non-OHE data frame).

So I target all columns that contain that position (i.e., 1_A, 1_B and 1_nan) and replace their values with NaN.

pattern = "^([^_]*)_"
regex = re.compile(pattern)

for index in df1_ohe.index:
    for col_nan in nan_df.columns:
        if df1_ohe.loc[index,col_nan] == 1:
            col_id = regex.search(col_nan).group(1)
            targets = df1_ohe.columns[df1_ohe.columns.str.startswith(col_id+'_')]
            df1_ohe.loc[index, targets] = np.nan

Giving me:

>>> df1_ohe
   0_A  0_C  0_nan  1_A  1_B  1_nan
0  1.0  0.0    0.0  1.0  0.0    0.0
1  NaN  NaN    NaN  0.0  1.0    0.0
2  0.0  1.0    0.0  NaN  NaN    NaN

Finally, I remove the NaN columns from the OHE data frame

df1_ohe.drop(df1_ohe.columns[df1_ohe.columns.str.endswith('_nan')], axis=1, inplace=True)


>>> df1_ohe
   0_A  0_C  1_A  1_B
0  1.0  0.0  1.0  0.0
1  NaN  NaN  0.0  1.0
2  0.0  1.0  NaN  NaN
like image 85
Álvaro Salgado Avatar answered Sep 28 '22 07:09

Álvaro Salgado


Method #1 would be to use v1's nans directly, without loops:

>>> df1 = pd.get_dummies(df)
>>> df1.loc[df.v1.isnull(), df1.columns.str.startswith("v1_")] = np.nan
>>> df1
   Id    v2  v1_A  v1_B
0   0  0.23   1.0   0.0
1   1  0.65   0.0   1.0
2   2  0.87   NaN   NaN

Method #2 would be to use the dummy_na argument to get us a column we could use:

>>> df1 = pd.get_dummies(df, dummy_na=True)
>>> df1
   Id    v2  v1_A  v1_B  v1_nan
0   0  0.23   1.0   0.0     0.0
1   1  0.65   0.0   1.0     0.0
2   2  0.87   0.0   0.0     1.0
>>> df1.loc[df1.v1_nan == 1, ["v1_A", "v1_B"]] = np.nan
>>> del df1["v1_nan"]
>>> df1
   Id    v2  v1_A  v1_B
0   0  0.23   1.0   0.0
1   1  0.65   0.0   1.0
2   2  0.87   NaN   NaN
like image 23
DSM Avatar answered Sep 28 '22 08:09

DSM