In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L
and then, for each group, I need to create a subgroup where the index is 1, 2, or 3
which will contain a record for each lowercase letter
(a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy
entry x
later in my code ...
how can avoid this ackish way to use groupby
?
Assuming the column under consideration to be represented by s
, we can:
Split on "."
delimiter along with expand=True
to produce an expanded DF
.
fnc
: checks if all elements of the grouped frame consists of only None
, then it replaces them by a dummy entry ""
which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None
's present here are subsequently removed using dropna
.
Perform groupby
w.r.t. 0 & 1 column names and apply fnc
to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF
, reset the indices same as the levels used to group the DF
before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With