I have a .csv file with around 300,000 rows. I have set it to group by a particular column, with each group having around 140 members (2138 total groups).
I am trying to generate a numpy array of the group names. I have used a for loop to generate the names as of now but it takes a while for everything to process.
import numpy as np
import pandas as pd
df = pd.read_csv('file.csv')
grouped = df.groupby('col1')
group_names = []
for name,group in grouped: group_names.append(name)
group_names = np.array(group_names, dtype=object)
I am wondering if there is a more efficient way to do this, whether by using a pandas module or directly converting the names into a numpy array.
The fastest way would most likely be just to use unique
on the column you are grouping by, which gives you all unique values. The output will be an array of your group names.
group_names = df.col1.unique()
groupby
objects have a .groups
attribute:
groups = df.groupby('col1').groups
this returns a dict of the group name->labels
example:
In[257]:
df = pd.DataFrame({'a':list('aabcccc'), 'b':np.random.randn(7)})
groups = df.groupby('a').groups
groups
Out[257]:
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2], dtype='int64'),
'c': Int64Index([3, 4, 5, 6], dtype='int64')}
groups.keys()
Out[258]: dict_keys(['a', 'b', 'c'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With