I have a dataframe that looks something like this:
d = {'Col_1' : pd.Series(['A', 'A', 'A', 'B']),
'Col_2' : pd.Series(['B', 'C', 'B', 'D']),
'Col_3' : pd.Series([np.nan, 'D', 'C', np.nan]),
'Col_4' : pd.Series([np.nan, np.nan, 'D', np.nan]),
'Col_5' : pd.Series([np.nan, np.nan, 'E', np.nan]),}
df = pd.DataFrame(d)
Col_1 Col_2 Col_3 Col_4 Col_5
A B NaN NaN NaN
A C D NaN NaN
A B C D E
B D NaN NaN NaN
My Goal is to end up with something along the lines of:
Col_1 Col_2 Col_3 Col_4 Col_5 ConCat
A B NaN NaN NaN A:B
A C D NaN NaN A:C:D
A B C D E A:B:C:D:E
B D NaN NaN NaN B:D
I've successfully created a dataframe that looks like the desired output from:
rows = df.values
df_1 = pd.DataFrame([':'.join(word for word in rows if word is not np.nan) for rows in rows])
0
0 A:B
1 A:C:D
2 A:B:C:D:E
3 B:D
But now when I attempt to place it into the original dataframe, I get:
df['concatenated'] = df_1
Col_1 Col_2 Col_3 Col_4 Col_5 concatenated
A B NaN NaN NaN NaN
A C D NaN NaN NaN
A B C D E NaN
B D NaN NaN NaN NaN
What's strange is that when creating a simplified example, it works as expected. Below if the full code of what I'm doing. The original data comes to me transposed from what the original dataframe above looks like.
df_caregiver_type = pd.concat([df_caregiver_type[col].order().reset_index(drop=True) for col in df_caregiver_type], axis=1, ignore_index=False).T
df_caregiver_type.rename(columns=lambda x: 'Col_' + str(x), inplace=True)
rows = df_caregiver_type.values
df_caregiver_type1 = pd.DataFrame([':'.join(word for word in rows if word is not np.nan) for rows in rows])
df_caregiver_type['concatenated'] = df_caregiver_type1
df_caregiver_type = df_caregiver_type.T
df_caregiver_type
Update I'm thinking I'm getting an error due to the first row of the full code. It's from a separate, but related question: pandas: sort each column individually
For your full dataset, change the last step from df['concatenated'] = df_1
to df['concatenated'] = df_1.values
will solve the issue, I think it a bug and I am very sure I have seen it in SO before.
Or just: df['concatenated'] = [':'.join(word for word in row if word is not np.nan) for row in rows]
>>> d = {'Col_1' : pd.Series(['A', 'A', 'A', 'B']),
... 'Col_2' : pd.Series(['B', 'C', 'B', 'D']),
... 'Col_3' : pd.Series([np.nan, 'D', 'C', np.nan]),
... 'Col_4' : pd.Series([np.nan, np.nan, 'D', np.nan]),
... 'Col_5' : pd.Series([np.nan, np.nan, 'E', np.nan]),}
>>> df = pd.DataFrame(d)
>>>
>>> rows = df.values
>>> df_1 = pd.DataFrame([':'.join(word for word in rows if word is not np.nan) for rows in rows])
>>>
>>> df['concatenated'] = df_1[0]
>>> df
Col_1 Col_2 Col_3 Col_4 Col_5 concatenated
0 A B NaN NaN NaN A:B
1 A C D NaN NaN A:C:D
2 A B C D E A:B:C:D:E
3 B D NaN NaN NaN B:D
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With