Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to get group names in pandas

I have a .csv file with around 300,000 rows. I have set it to group by a particular column, with each group having around 140 members (2138 total groups).

I am trying to generate a numpy array of the group names. I have used a for loop to generate the names as of now but it takes a while for everything to process.

import numpy as np
import pandas as pd

df = pd.read_csv('file.csv')
grouped = df.groupby('col1')
group_names = []
for name,group in grouped: group_names.append(name)
group_names = np.array(group_names, dtype=object)

I am wondering if there is a more efficient way to do this, whether by using a pandas module or directly converting the names into a numpy array.

like image 876
swopnil Avatar asked Jun 14 '18 14:06

swopnil


2 Answers

The fastest way would most likely be just to use unique on the column you are grouping by, which gives you all unique values. The output will be an array of your group names.

group_names = df.col1.unique()
like image 38
sacuL Avatar answered Sep 28 '22 00:09

sacuL


groupby objects have a .groups attribute:

groups = df.groupby('col1').groups

this returns a dict of the group name->labels

example:

In[257]:
df = pd.DataFrame({'a':list('aabcccc'), 'b':np.random.randn(7)})
groups = df.groupby('a').groups
groups

Out[257]: 
{'a': Int64Index([0, 1], dtype='int64'),
 'b': Int64Index([2], dtype='int64'),
 'c': Int64Index([3, 4, 5, 6], dtype='int64')}

groups.keys()
Out[258]: dict_keys(['a', 'b', 'c'])
like image 143
EdChum Avatar answered Sep 28 '22 00:09

EdChum