Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iteration order with pandas groupby on a pre-sorted DataFrame

The Situation

I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:

df = pd.DataFrame({'A': [list with classifier ids],  # Only 3 ids, One word strings
                   'B': [List of text to be classified],  # Millions of unique rows, lines of text around 5-25 words long
                   'C': [List of the old classes]}  # Hundreds of possible classes, four digit integers stored as strings

df.sort_values('A', inplace=True)

new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
    classifier = classy_dict[name]
    vectors = vectorize(group.B.values)

    preds = classifier.predict(vectors)
    scores = classifier.decision_function(vectors)

    for tup in zip(preds, scores, group.C.values):
        if tup[2] == tup[0]:
            new_col1.append(np.nan)
            new_col2.append(tup[2])

        else:
            new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
            new_col2.append(np.nan)

df['D'] = new_col1
df['E'] = new_col2

The Issue

I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs

My Expectations

All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.

Here is the code I used to test my theory on sort=False iteration order:

from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers

df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
                   'B': randint(10, size=100)})

print(df.A.unique())  # unique values in order of appearance per the docs

for name, group in df.groupby('A', sort=False):
    print(name)

Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.

like image 984
Eric Ed Lohmar Avatar asked Mar 07 '23 15:03

Eric Ed Lohmar


1 Answers

Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.

def ngroup(self, ascending=True):
    """
    Number each group from 0 to the number of groups - 1.
    This is the enumerative complement of cumcount.  Note that the
    numbers given to the groups match the order in which the groups
    would be seen when iterating over the groupby object, not the
    order they are first observed.
    ""

Data from @coldspeed

df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()

Output:

    col  sort=False  sort=True
0   16           0          7
1    1           1          0
2   10           2          5
3   20           3          8
4    3           4          2
5   13           5          6
6    2           6          1
7    5           7          3
8    7           8          4

When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.

like image 86
ALollz Avatar answered Mar 16 '23 00:03

ALollz