I am currently trying to figure out efficient way to split single panadas DataFrame rows into multiple, slightly changed rows. Imagine such structure:
A C1 C2 C3 C4
1 a b c a
2 b a e b a
3 g c
4 d e
and I want to end up with structure like that:
A C
1 a b
2 a c
3 a a
4 b a
5 b e
6 b b
7 b a
8 g c
9 d e
10 d e
So far I've been using for loops and create dictionaries like that (df is my DataFrame):
rows = []
for i, r in df.iterrows():
tmp = r[1:].dropna()
for c in tmp.values:
dict = {'A': r[0], 'C': c}
rows.append(dict)
Unfortunately this approach is extremly slow. So far after my work with pandas I see that when using only it execution time can be significantly improved, but I don't have so much experience to figure out how to make this case faster.
Can someone advice, what can be done to speed this up?
try this:
In [10]: pd.melt(df, id_vars='A', value_vars=['C1','C2','C3','C4'])
Out[10]:
A variable value
0 a C1 b
1 b C1 a
2 g C1 c
3 d C1 e
4 a C2 c
5 b C2 e
6 g C2 NaN
7 d C2 NaN
8 a C3 a
9 b C3 b
10 g C3 NaN
11 d C3 NaN
12 a C4 NaN
13 b C4 a
14 g C4 NaN
15 d C4 NaN
if you want to get rid of NaN's:
In [15]: pd.melt(df, id_vars='A', value_vars=['C1','C2','C3','C4'], value_name='C')[['A','C']].dropna()
Out[15]:
A C
0 a b
1 b a
2 g c
3 d e
4 a c
5 b e
8 a a
9 b b
13 b a
the same, but selecting C* columns dynamically:
In [21]: (pd.melt(df, id_vars='A',
....: value_vars=df.filter(like='C').columns.tolist(),
....: value_name='C')[['A','C']]
....: .dropna()
....: )
Out[21]:
A C
0 a b
1 b a
2 g c
3 d e
4 a c
5 b e
8 a a
9 b b
13 b a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With