Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

Tags:

python

pandas

To pass multiple variables to a normal python function you can just write something like:

def a_function(date,string,float):
      do something....
      convert string to int, 
      date = date + (float * int) days
      return date

When using Pandas DataFrames I know you can create a new column based on the contents of one like so:

df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year

What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?

For example combining three separate parts of a date Y - M - D into one field.

df['whole_date']) = df['Year','Month','Day'].map(a_function)

I get a key error with the following test.

def combine(one,two,three):
    return one + two + three

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})

df['d'] = df['a','b','b'].map(combine)

Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?

-> Example input: 1, 2, 3

-> Example output: 1*2*3

Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?

like image 409
yoshiserry Avatar asked May 22 '15 05:05

yoshiserry


1 Answers

Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3

To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:

def combine(row):
    return row['a'] + row['b'] + row['c']

>>> df.apply(combine, axis=1)
0     7
1    10
2    13

Or you can pass a lambda which unpacks the Series into separate arguments:

def combine(one,two,three):
    return one + two + three

>>> df.apply(lambda x: combine(*x), axis=1)
0     7
1    10
2    13

If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:

>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0     7
1    10
2    13

Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)

However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:

>>> combine(df.a, df.b, df.c)
0     7
1    10
2    13

This is typically much more efficient when the "combining" operation is vectorizable.

Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?

As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:

>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
        date
0 2015-05-01
1 2015-05-02
2 2015-05-03

You can define a function that returns a Series for each value, and then apply it to the column:

def dateComponents(date):
    return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])

>>> df.date.apply(dateComponents)
11:    Year  Month  Day
0  2015      5    1
1  2015      5    2
2  2015      5    3

In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:

>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
        a
0  Hello
1  There
2    Pal

>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
   FirstChar  Length
0         H       5
1         T       5
2         P       3

Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.

like image 182
BrenBarn Avatar answered Nov 14 '22 23:11

BrenBarn