Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas groupby operation with missing data

In a pandas dataframe I have a column that looks like:

0         M
1         E
2         L
3       M.1
4       M.2
5       M.3
6       E.1
7       E.2
8       E.3
9       E.4
10      L.1
11      L.2
12    M.1.a
13    M.1.b
14    M.1.c
15    M.2.a
16    M.3.a
17    E.1.a
18    E.1.b
19    E.1.c
20    E.2.a
21    E.3.a
22    E.3.b
23    E.4.a

I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...) Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))

0    1    2
E    1    a
          b
          c
     2    a
     3    a
          b
     4    a
L    1
     2
M    1    a
          b
          c
     2    a
     3    a

I tried with:

df.groupby([0,1,2]).count()

But the result is missing the L level because it doesn't have records at the last sub-level

A workaround is to add a dummy variable and then remove it ... like:

df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()

which gives:

0    1    2
E    1    a
          b
          c
     2    a
     3    a
          b
     4    a
L    1    x
     2    x
M    1    a
          b
          c
     2    a
     3    a

I then deal with the dummy entry x later in my code ...

how can avoid this ackish way to use groupby ?

like image 757
epifanio Avatar asked Nov 09 '22 02:11

epifanio


1 Answers

Assuming the column under consideration to be represented by s, we can:

  1. Split on "." delimiter along with expand=True to produce an expanded DF.

  2. fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.

  3. Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.


split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)

produces:

0  1   
E  1  1    a
      2    b
      3    c
   2  1    a
   3  1    a
      2    b
   4  1    a
L  1  0    
   2  0    
M  1  1    a
      2    b
      3    c
   2  1    a
   3  1    a
Name: 2, dtype: object

To obtain a flattened DF, reset the indices same as the levels used to group the DF before:

split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)

produces:

    0  1  2
0   E  1  a
1   E  1  b
2   E  1  c
3   E  2  a
4   E  3  a
5   E  3  b
6   E  4  a
7   L  1   
8   L  2   
9   M  1  a
10  M  1  b
11  M  1  c
12  M  2  a
13  M  3  a
like image 145
Nickil Maveli Avatar answered Nov 14 '22 21:11

Nickil Maveli