I have a dataframe (called panel[xyz]) containing only 1, 0 and -1. The dimensions are: rows 0:10 and columns a:j.
I would like to create another dataframe (df) which has the same vertical axis, but only 3 columns: col_1 = count all non-zero values (1s and -1s) col_2 = count all 1s col_3 = count all -1s
I found this in searching SO:
df[col_1] = (pan[xyz]['a','b','c','d','e'] > 0).count(axis=1)
...and have tried many different iterations, but I cannot get the conditional (>0) to distinguish between the different values in pan[xyz]. The count is always = 5.
Any help would be much appreciated.
Edit:
pan[xyz] =
. 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
0 1 0 0 -1 0 0 -1 0 1 0
1 0 1 0 0 0 1 0 0 0 -1
2 1 0 0 0 0 -1 0 0 0 0
3 0 -1 0 0 0 0 0 1 0 0
4 0 0 0 1 0 0 -1 0 0 -1
df should be =
. col_1 col_2 col_3
0 4 2 2
1 3 2 1
2 2 1 1
3 2 1 1
4 3 1 2
But this is what i get for col_1 :
df = (panel[xyz] > 0).count(axis=1)
df
Out[129]:
0 10
1 10
2 10
3 10
4 10
dtype: int6
Pandas value_counts() on multiple columns (or on a dataframe) Sometimes you might want to tabulate counts of multiple variables. With Pandas version 1.1. 0 and above we can use Pandas' value_coiunts() function to get counts for multiple variable.
Dataset in use: We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
I'm just doing this with a flat dataframe but it's the same for panel. You can do one of two ways. The first way is what you did, just change the count()
to sum()
:
( df > 0 ).sum(axis=1)
The underlying structure is boolean and True and False both get counted, whereas if you sum them it is interpreted more like you were expecting (0/1).
But a more standard way to do it would be like this:
df[ df > 0 ].count(axis=1)
While the former method was based on a dataframe of booleans, the latter looks like this:
df[ df > 0 ]
a b c d e f g h i j
0 1 NaN NaN NaN NaN NaN NaN NaN 1 NaN
1 NaN 1 NaN NaN NaN 1 NaN NaN NaN NaN
2 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN
4 NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN
In this case it doesn't really matter which method you use, but in general the latter is going to be better, because you can do more with it. For example, with the former method (which has binary outcomes by design), all you can really do is count, but in the latter method you can count, sum, multiply, etc.
The potential usefulness of this may be more obvious for the case of df != 0
, where there are more than two possible values:
df[ df != 0 ]
a b c d e f g h i j
0 1 NaN NaN -1 NaN NaN -1 NaN 1 NaN
1 NaN 1 NaN NaN NaN 1 NaN NaN NaN -1
2 1 NaN NaN NaN NaN -1 NaN NaN NaN NaN
3 NaN -1 NaN NaN NaN NaN NaN 1 NaN NaN
4 NaN NaN NaN 1 NaN NaN -1 NaN NaN -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With