We have a DataFrame that looks like this: <pre class="prettyprint"><code>> df.ix[:2,:10] 0 1 2 3 4 5 6 7 8 9 10 0 NaN NaN NaN NaN 6 5 NaN NaN 4 NaN 5 1 NaN NaN NaN NaN 8 NaN NaN 7 NaN NaN 5 2 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN </code></pre> We simply want the counts of all unique values in the DataFrame. A simple solution is: <pre class="prettyprint"><code>df.stack().value_counts() </code></pre> However: 1. It looks like <code>stack</code> returns a copy, not a view, which is memory prohibitive in this case. Is this correct? 2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with <code>stack</code> and use it for now, how does one do the grouping correctly? <pre class="prettyprint"><code>d = pd.DataFrame([[nan, 1, nan, 2, 3], [nan, 1, 1, 1, 3], [nan, 1, nan, 2, 3], [nan,2,2,2, 3]]) len(d.stack()) #14 d.stack().groupby(arange(4)) AssertionError: Grouper and axis must be same length </code></pre> The stacked DataFrame has a MultiIndex, with a length of some number less than <code>n_rows*n_columns</code>, because the <code>nan</code>s are removed. <pre class="prettyprint"><code>0 1 1 3 2 4 3 1 0 1 1 1 2 1 3 1 4 3 .... </code></pre> This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want. <pre class="prettyprint"><code>d.stack().groupby(level=0).groupby(list('aabb')) KeyError: 'a' </code></pre> Edit: A solution, which doesn't use stacking: <pre class="prettyprint"><code>f = lambda x: pd.value_counts(x.values.ravel()) d.groupby(list('aabb')).apply(f) a 1 4 3 2 2 1 b 2 4 3 2 1 1 dtype: int64 </code></pre> Looks clunky, though. If there's a better option I'm happy to hear it. Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.

I think you are doing a row/column-wise operation so can use <code>apply</code>: <pre class="prettyprint"><code>In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0) Out[11]: 1 2 3 0 1 1 1 1 4 0 1 2 1 1 1 3 0 4 1 </code></pre> Note: There is a <code>value_counts</code> DataFrame method in the works for 0.14... which will make this more efficient and more concise. It's worth noting that the pandas <code>value_counts</code> function also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array view using <code>np.ravel</code>): <pre class="prettyprint"><code>In [21]: pd.value_counts(d.values.ravel()) Out[21]: 2 6 1 6 3 4 dtype: int64 </code></pre> Also, you were pretty close to getting this correct, but you'd need to stack and unstack: <pre class="prettyprint"><code>In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0) Out[22]: 1 2 3 0 1 1 1 1 4 0 1 2 1 1 1 3 0 4 1 </code></pre> This error seems somewhat self explanatory (4 != 16): <pre class="prettyprint"><code>len(d.stack()) #16 d.stack().groupby(arange(4)) AssertionError: Grouper and axis must be same length </code></pre> perhaps you wanted to pass: <pre class="prettyprint"><code>In [23]: np.repeat(np.arange(4), 4) Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]) </code></pre>

Not enough rep to comment, but Andy's answer: <pre class="prettyprint"><code>pd.value_counts(d.values.ravel()) </code></pre> is what I have used personally, and seems to me to be by far the most versatile and easily-readable solution. Another advantage is that it is easy to use a subset of the columns: <pre class="prettyprint"><code>pd.value_counts(d[[1,3,4,6,7]].values.ravel()) </code></pre> or <pre class="prettyprint"><code>pd.value_counts(d[["col_title1","col_title2"]].values.ravel()) </code></pre> Is there any disadvantage to this approach, or any particular reason you want to use stack and groupby?

Pandas: Counting unique values in a dataframe

We have a DataFrame that looks like this:

> df.ix[:2,:10]
    0   1   2   3   4   5   6   7   8   9   10
0   NaN NaN NaN NaN  6   5  NaN NaN  4  NaN  5
1   NaN NaN NaN NaN  8  NaN NaN  7  NaN NaN  5
2   NaN NaN NaN NaN NaN  1  NaN NaN NaN NaN NaN

We simply want the counts of all unique values in the DataFrame. A simple solution is:

df.stack().value_counts()

However: 1. It looks like stack returns a copy, not a view, which is memory prohibitive in this case. Is this correct? 2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with stack and use it for now, how does one do the grouping correctly?

d = pd.DataFrame([[nan, 1, nan, 2, 3],
              [nan, 1, 1, 1, 3],
              [nan, 1, nan, 2, 3],
              [nan,2,2,2, 3]])

len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns, because the nans are removed.

This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want.

d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'

Edit: A solution, which doesn't use stacking:

f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a  1    4
   3    2
   2    1
b  2    4
   3    2
   1    1
dtype: int64

Looks clunky, though. If there's a better option I'm happy to hear it.

Edit: Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.

How do I find unique values in a DataFrame column in Python?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

I think you are doing a row/column-wise operation so can use apply:

In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0)
Out[11]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

Note: There is a value_counts DataFrame method in the works for 0.14... which will make this more efficient and more concise.

It's worth noting that the pandas value_counts function also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array view using np.ravel):

In [21]: pd.value_counts(d.values.ravel())
Out[21]: 
2    6
1    6
3    4
dtype: int64

Also, you were pretty close to getting this correct, but you'd need to stack and unstack:

In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0)
Out[22]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

This error seems somewhat self explanatory (4 != 16):

len(d.stack()) #16
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

perhaps you wanted to pass:

In [23]: np.repeat(np.arange(4), 4)
Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

Not enough rep to comment, but Andy's answer:

pd.value_counts(d.values.ravel())

is what I have used personally, and seems to me to be by far the most versatile and easily-readable solution. Another advantage is that it is easy to use a subset of the columns:

pd.value_counts(d[[1,3,4,6,7]].values.ravel())

or

pd.value_counts(d[["col_title1","col_title2"]].values.ravel())

Is there any disadvantage to this approach, or any particular reason you want to use stack and groupby?

Pandas: Counting unique values in a dataframe

Tags:

python

pandas

jeffalstott

People also ask

2 Answers

Andy Hayden

tegan

Recent Activity

Donate For Us

Pandas: Counting unique values in a dataframe

Tags:

python

pandas

jeffalstott

People also ask

2 Answers

Andy Hayden

tegan

Related questions

Recent Activity

Donate For Us