I have a DataFrame df
with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary labeldict
with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I'd like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.
Here is the setup:
import pandas as pd import numpy as np np.random.seed(1) n = 10 labels = list('abcdef') colors = ['red', 'green', 'blue'] sizes = ['small', 'medium', 'large'] labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels} df = pd.DataFrame({'label': np.random.choice(labels, n), 'somedata': np.random.randn(n)})
I can get what I want by running:
df['color'], df['size'] = zip(*df['label'].map(labeldict)) print df label somedata color size 0 b 0.196643 red medium 1 c -1.545214 green small 2 a -0.088104 green small 3 c 0.852239 green small 4 b 0.677234 red medium 5 c -0.106878 green small 6 a 0.725274 green small 7 d 0.934889 red medium 8 a 1.118297 green small 9 c 0.055613 green small
But how can I do this if I don't want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict
instead of 2-tuples, this would be a real pain as currently written. Here are a couple things that don't work:
# set up attrlist for later use attrlist = ['color', 'size'] # non-working idea 1) df[attrlist] = zip(*df['label'].map(labeldict)) # non-working idea 2) df.loc[:, attrlist] = zip(*df['label'].map(labeldict))
This does work, but seems like a hack:
for a in attrlist: df[a] = 0 df[attrlist] = zip(*df['label'].map(labeldict))
Better solutions?
If you want to add multiple columns to a DataFrame as part of a method chain, you can use apply . The first step is to create a function that will transform a row represented as a Series into the form you want. Then you can call apply to use this function on each row.
import pandas as pd df = {'col_1': [0, 1, 2, 3], 'col_2': [4, 5, 6, 7]} df = pd. DataFrame(df) df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np. nan, 'dogs',3] #thought this would work here...
Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.
Just use result_type='expand'
in pandas apply
df Out[78]: a b 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand') df Out[80]: a b mean std max 0 0 1 0.5 0.5 1.0 1 2 3 2.5 0.5 3.0 2 4 5 4.5 0.5 5.0 3 6 7 6.5 0.5 7.0 4 8 9 8.5 0.5 9.0
and here some copy paste code
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['a','b']) print('df',df, sep='\n') print() def mathOperationsTuple(arr): return np.mean(arr), np.std(arr), np.amax(arr) df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand') print('df',df, sep='\n')
You can use merge instead:
>>> ld = pd.DataFrame(labeldict).T >>> ld.columns = ['color', 'size'] >>> ld.index.name = 'label' >>> df.merge(ld.reset_index(), on='label') label somedata color size 0 b 1.462108 red medium 1 c -2.060141 green small 2 c 1.133769 green small 3 c 0.042214 green small 4 e -0.322417 red medium 5 e -1.099891 red medium 6 e -0.877858 red medium 7 e 0.582815 red medium 8 f -0.384054 red large 9 d -0.172428 red medium
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With