Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas: passing in dataframe to df.apply

Long time user of this site but first time asking a question! Thanks to all of the benevolent users who have been answering questions for ages :)

I have been using df.apply lately and ideally want to pass a dataframe into the args parameter to look something like so: df.apply(testFunc, args=(dfOther), axis = 1)

My ultimate goal is to iterate over the dataframe I am passing in the args parameter and check logic against each row of the original dataframe, say df , and return some value from dfOther . So say I have a function like this:

def testFunc(row, dfOther):
    for index, rowOther in dfOther.iterrows():
        if row['A'] == rowOther[0] and row['B'] == rowOther[1]:
            return dfOther.at[index, 'C']

df['OTHER'] = df.apply(testFunc, args=(dfOther), axis = 1)

My current understanding is that args expects a Series object, and so if I actually run this we get the following error:

ValueError: The truth value of a DataFrame is ambiguous. 
Use a.empty, a.bool(), a.item(), a.any() or a.all().

However before I wrote testFunc which only passes in a single dataframe, I had actually written priorTestFunc, which looks like this... And it works!

def priorTestFunc(row, dfOne, dfTwo):
    for index, rowOne in dfOne.iterrows():
        if row['A'] == rowOne[0] and row['B'] == rowOne[1]:
            return dfTwo.at[index, 'C']

df['OTHER'] = df.apply(testFunc, args=(dfOne, dfTwo), axis = 1)

So to my dismay I have been coming into the habit of writing testFunc like so and it has been working as intended:

def testFunc(row, dfOther, _):
    for index, rowOther in dfOther.iterrows():
        if row['A'] == rowOther[0] and row['B'] == rowOther[1]:
            return dfOther.at[index, 'C']

df['OTHER'] = df.apply(testFunc, args=(dfOther, _), axis = 1)

I would really appreciate if someone could let me know why this would be the case and maybe errors that I will be prone to, or maybe another alternative for solving this kind of problem!!

EDIT: As requested by the comment: My dfs generally look like the below.. They will have two matching columns and will be returning a value from the dfOther.at[index, column] I have considered pd.concat([dfOther, df]) however I will be running an algorithm testing conditions on df and then updating it accordingly from specific values on dfOther(which will also be updating) and I would like df to be relatively neat, as opposed to making a multindex and throwing just about everything in it. Also I am aware df.iterrows is in general slow, but these dataframes will be about 500 rows at the max, so scalability isn't really a massive concern for me at the moment.

df
Out[10]: 
    A    B      C
0  foo  bur   6000
1  foo  bur   7000
2  foo  bur   8000
3  bar  kek   9000
4  bar  kek  10000
5  bar  kek  11000

dfOther
Out[12]: 
    A    B      C
0  foo  bur   1000
1  foo  bur   2000
2  foo  bur   3000
3  bar  kek   4000
4  bar  kek   5000
5  bar  kek   6000
like image 690
jboxxx Avatar asked Jun 04 '16 12:06

jboxxx


People also ask

How will you apply a function to every data element in a DataFrame?

One can use apply() function in order to apply function to every row in given dataframe.

How do you apply a DataFrame function in Python?

The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).

Is pandas apply in place?

No, the apply() method doesn't contain an inplace parameter, unlike these pandas methods which have an inplace parameter: df. drop()

What is Apply function in pandas?

Pandas. apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.


1 Answers

The error is in this line:

  File "C:\Anaconda3\envs\p2\lib\site-packages\pandas\core\frame.py", line 4017, in apply
    if kwds or args and not isinstance(func, np.ufunc):

Here, if kwds or args is checking whether the length of args passed to apply is greater than 0. It is a common way to check if an iterable is empty:

l = []

if l:
    print("l is not empty!")
else:
    print("l is empty!")

l is empty!

l = [1]

if l:
    print("l is not empty!")
else:
    print("l is empty!")

l is not empty!

If you had passed a tuple to df.apply as args, it would return True and there wouldn't be a problem. However, Python does not interpret (df) as a tuple:

type((df))
Out[39]: pandas.core.frame.DataFrame

It is just a DataFrame/variable inside parentheses. When you type if df:

if df:
    print("df is not empty")

Traceback (most recent call last):

  File "<ipython-input-40-c86da5a5f1ee>", line 1, in <module>
    if df:

  File "C:\Anaconda3\envs\p2\lib\site-packages\pandas\core\generic.py", line 887, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

You get the same error message. However, if you use a comma to indicate that it'a tuple, it works fine:

if (df, ):
    print("tuple is not empty")

tuple is not empty

As a result, adding a comma to args=(dfOther) by making it a singleton should solve the problem.

df['OTHER'] = df.apply(testFunc, args=(dfOther, ), axis = 1)
like image 172
ayhan Avatar answered Oct 19 '22 16:10

ayhan