Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Get unique values from column along with lists of row indices where they appear

Tags:

python

pandas

My dataframe has a string column that can contain long strings. I want to get a list of unique strings, and also a list for each unique string containing row indices where it appears.

I can think of two ways of doing this.

  1. First get the unique list using .unique() and then iterate over the dataframe to build up lists of indices where each unique value shows up
  2. Use .groupBy() to create groups and get the lists of row indices in each group

But I am not quite sure which one is more efficient (or if there are other ways to do this more efficiently). The reason I am thinking about efficiency is that the field I want to uniquify and groupBy is a string field possibly having long strings!

Thanks!

like image 481
shikhanshu Avatar asked Sep 13 '17 18:09

shikhanshu


People also ask

How do I get a list of unique values from a column in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

How do I get unique values from a column with repeated values in Python?

To get unique values from a column in a DataFrame, use the unique(). To count the unique values from a column in a DataFrame, use the nunique().

How can I see unique values in pandas?

The easiest way to obtain a list of unique values in a pandas DataFrame column is to use the unique() function.

How do you show unique values in a DataFrame column?

To get the unique values in multiple columns of a dataframe, we can merge the contents of those columns to create a single series object and then can call unique() function on that series object i.e. It returns the count of unique elements in multiple columns.


1 Answers

Demo:

In [16]: df
Out[16]:
    col
0  aaaa
1  bbbb
2  aaaa
3  aaaa
4  bbbb
5  cccc

In [17]: df.groupby('col').groups
Out[17]:
{'aaaa': Int64Index([0, 2, 3], dtype='int64'),
 'bbbb': Int64Index([1, 4], dtype='int64'),
 'cccc': Int64Index([5], dtype='int64')}

or as a DataFrame:

In [31]: pd.DataFrame([[k,v.values]
                        for k,v in df.groupby('col').groups.items()], 
                      columns=['col','indices'])
Out[31]:
    col    indices
0  aaaa  [0, 2, 3]
1  bbbb     [1, 4]
2  cccc        [5]
like image 137
MaxU - stop WAR against UA Avatar answered Oct 05 '22 16:10

MaxU - stop WAR against UA