Having the following data frame, group A have 4 samples, B 3 samples and C 1 sample:
group data_1 data_2
0 A 1 4
1 A 2 5
2 A 3 6
3 A 4 7
4 B 1 4
5 B 2 5
6 B 3 6
7 C 1 4
I would like to transform the data into numpy array, where each row is a group with all its samples and zero padding for groups that have fewer samples.
Resulting in an array like so:
[
[[1,4],[2,5],[3,6],[4,7]], # this is A group 4 samples
[[1,4],[2,5],[3,6],[0,0]], # this is B group 3 samples
[[1,4],[0,0],[0,0],[0,0]], # this is C group 1 sample
]
This data structure can be converted to NumPy ndarray with the help of the DataFrame. to_numpy() method.
You call .groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .groupby() as the first argument.
First is necessary add missing values - first solution with unstack and stack, counter Series is created by cumcount.
Second solution use reindex by MultiIndex.
Last use lambda function with groupby, convert to numpy array by values and last to lists:
g = df.groupby('group').cumcount()
L = (df.set_index(['group',g])
.unstack(fill_value=0)
.stack().groupby(level=0)
.apply(lambda x: x.values.tolist())
.tolist())
print (L)
[[[1, 4], [2, 5], [3, 6], [4, 7]],
[[1, 4], [2, 5], [3, 6], [0, 0]],
[[1, 4], [0, 0], [0, 0], [0, 0]]]
Another solution:
g = df.groupby('group').cumcount()
mux = pd.MultiIndex.from_product([df['group'].unique(), g.unique()])
L = (df.set_index(['group',g])
.reindex(mux, fill_value=0)
.groupby(level=0)['data_1','data_2']
.apply(lambda x: x.values.tolist())
.tolist()
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With