I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings
'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long
'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as strings
df.sort_values('A', inplace=True)
new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
classifier = classy_dict[name]
vectors = vectorize(group.B.values)
preds = classifier.predict(vectors)
scores = classifier.decision_function(vectors)
for tup in zip(preds, scores, group.C.values):
if tup[2] == tup[0]:
new_col1.append(np.nan)
new_col2.append(tup[2])
else:
new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
new_col2.append(np.nan)
df['D'] = new_col1
df['E'] = new_col2
I am concerned that groupby
will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False
is not covered in the docs
All I'm looking for here is some affirmation that groupby('col', sort=False)
does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False
iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers
df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
'B': randint(10, size=100)})
print(df.A.unique()) # unique values in order of appearance per the docs
for name, group in df.groupby('A', sort=False):
print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.
Yes, when you pass sort=False
the order of first appearance is preserved. The groupby
source code is a little opaque, but there is one function groupby.ngroup
which fully answers this question, as it directly tells you the order in which iteration occurs.
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
""
Data from @coldspeed
df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()
col sort=False sort=True
0 16 0 7
1 1 1 0
2 10 2 5
3 20 3 8
4 3 4 2
5 13 5 6
6 2 6 1
7 5 7 3
8 7 8 4
When sort=False
you iterate based on the first appearance, when sort=True
it sorts the groups, and then iterates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With