I have two versions of a function that uses Pandas
for Python 2.7
to go through inputs.csv
, row by row.
The first version uses Series.apply()
on a single column
, and goes through each row as intended.
The second version uses DataFrame.apply()
on multiple columns
, and for some reason it reads the top row twice. It then goes on to execute the rest of the rows without duplicates.
Any ideas why the latter reads the top row twice?
Version #1 – Series.apply()
(Reads top row once)
import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")
def v1(x):
y = x
return pd.Series(y)
df["Y"] = df["X"].apply(v1)
Version #2 – DataFrame.apply()
(Reads top row twice)
import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")
def v2(f):
y = f["X"]
return pd.Series(y)
df["Y"] = df[(["X", "Z"])].apply(v2, axis=1)
print y
:
v1(x): v2(f):
Row_1 Row_1
Row_2 Row_1
Row_3 Row_2
Row_3
This is by design, as described here and here
The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.
I sincerely don't see any explanation on this in the provided links, but anyway: I stumbled upon the same in my code, and did the silliest thing, i.e. short-circuit the first call. But it worked.
is_first_call = True
def refill_uniform(row, st=600):
nonlocal is_first_call
if is_first_call:
is_first_call = False
return row
... here goes the code
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With