Is it possible to covert a df into a matrix like the following? Given df
:
Name Value
x 5
x 2
x 3
x 3
y 3
y 2
z 4
The matrix would be:
Name 1 2 3 4 5
x 4 4 3 1 1
y 2 2 1 0 0
z 1 1 1 1 0
Here's the logic behind it:
Name 1 2 3 4 5 (5 columns since 5 is the max in Value)
--------------------------------------------------------------------
x 4 (since x has 4 values >= 1) 4 (since x has 4 values >= 2) 3 (since x has 3 values >= 3) 1 (since x has 1 values >= 4) 1 (since 1 x >= 5)
y 2 (since y has 2 values >= 1) 2 (since y has 2 values >= 2) 1 (since y has 1 values >= 3) 0 (since no more y >= 5) 0 (since no more y >= 5)
z 1 (since z has 1 values >= 1) 1 (since z has 1 values >= 2) 1 (since z has 1 values >= 3) 1 (since z has 1 values >= 4) 0 (since no more z >= 5)
Let me know if this makes sense.
I know I have to use sort, group, and count but couldn't figure out how to set up the matrix.
Thank you!!!
Probably the fastest solution, using numpy
's broadcasting -
i = np.arange(1, df.Value.max() + 1)
j = df.Value.values[:, None] >= i
df = pd.DataFrame(j, columns=i, index=df.Name).sum(level=0)
1 2 3 4 5
Name
x 4.0 4.0 3.0 1.0 1.0
y 2.0 2.0 1.0 0.0 0.0
z 1.0 1.0 1.0 1.0 0.0
Caveat: In exchange for performance, this is somewhat of a memory hungry method. For large data, it may result in a memory blowout, so use with discretion.
Details
Create a range of values, from 1
to df.Value.max()
-
i = np.arange(1, df.Value.max() + 1)
i
array([1, 2, 3, 4, 5])
Perform a broadcasted comparison with df.Values
and i
-
j = df.Value.values[:, None] >= i
j
array([[ True, True, True, True, True],
[ True, True, False, False, False],
[ True, True, True, False, False],
[ True, True, True, False, False],
[ True, True, True, False, False],
[ True, True, False, False, False],
[ True, True, True, True, False]], dtype=bool)
Load this into a dataframe, and perform a grouped sum by df.Name
to get your final result.
k = pd.DataFrame(j, columns=i, index=df.Name)
k
1 2 3 4 5
Name
x True True True True True
x True True False False False
x True True True False False
x True True True False False
y True True True False False
y True True False False False
z True True True True False
k.sum(level=0)
1 2 3 4 5
Name
x 4.0 4.0 3.0 1.0 1.0
y 2.0 2.0 1.0 0.0 0.0
z 1.0 1.0 1.0 1.0 0.0
If you need to convert the result to integers, call .astype(int)
-
k.sum(level=0).astype(int)
1 2 3 4 5
Name
x 4 4 3 1 1
y 2 2 1 0 0
z 1 1 1 1 0
This isn't the prettiest, but should work:
d2 = df.pivot_table(index="Name", columns="Value", aggfunc=len)
d2 = d2.reindex(range(1, df["Value"].max()+1), axis=1).fillna(0)
d2 = d2.iloc[:, ::-1].cumsum(axis=1).iloc[:, ::-1]
gives me
In [115]: d2
Out[115]:
Value 1 2 3 4 5
Name
x 4.0 4.0 3.0 1.0 1.0
y 2.0 2.0 1.0 0.0 0.0
z 1.0 1.0 1.0 1.0 0.0
where the repeated .iloc[:, ::-1]
is just to get the cumulative sum to occur right-to-left.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With