Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas function: DataFrame.apply() runs top row twice [duplicate]

Tags:

python

pandas

I have two versions of a function that uses Pandas for Python 2.7 to go through inputs.csv, row by row.

The first version uses Series.apply() on a single column, and goes through each row as intended.

The second version uses DataFrame.apply() on multiple columns, and for some reason it reads the top row twice. It then goes on to execute the rest of the rows without duplicates.

Any ideas why the latter reads the top row twice?


Version #1 – Series.apply() (Reads top row once)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v1(x):
    y = x
    return pd.Series(y)
df["Y"] = df["X"].apply(v1)

Version #2 – DataFrame.apply() (Reads top row twice)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v2(f):
    y = f["X"]
    return pd.Series(y)
df["Y"] = df[(["X", "Z"])].apply(v2, axis=1)

print y:

v1(x):            v2(f):

    Row_1         Row_1
    Row_2         Row_1
    Row_3         Row_2
                  Row_3
like image 905
P A N Avatar asked Aug 07 '15 12:08

P A N


2 Answers

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.

like image 151
AZhao Avatar answered Oct 19 '22 22:10

AZhao


I sincerely don't see any explanation on this in the provided links, but anyway: I stumbled upon the same in my code, and did the silliest thing, i.e. short-circuit the first call. But it worked.

is_first_call = True

def refill_uniform(row, st=600):
    nonlocal is_first_call
    if is_first_call:
        is_first_call = False
        return row

... here goes the code

like image 1
Oleg O Avatar answered Oct 19 '22 20:10

Oleg O