Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add new column to dataframe based on dictionary

Tags:

I have a dataframe and a dictionary. I need to add a new column to the dataframe and calculate its values based on the dictionary.

Machine learning, adding new feature based on some table:

score = {(1, 45, 1, 1) : 4, (0, 1, 2, 1) : 5} df = pd.DataFrame(data = {     'gender' :      [1,  1,  0, 1,  1,  0,  0,  0,  1,  0],     'age' :         [13, 45, 1, 45, 15, 16, 16, 16, 15, 15],     'cholesterol' : [1,  2,  2, 1, 1, 1, 1, 1, 1, 1],     'smoke' :       [0,  0,  1, 1, 7, 8, 3, 4, 4, 2]},      dtype = np.int64)  print(df, '\n') df['score'] = 0 df.score = score[(df.gender, df.age, df.cholesterol, df.smoke)] print(df) 

I expect the following output:

   gender  age  cholesterol  smoke    score 0       1   13            1      0      0  1       1   45            2      0      0 2       0    1            2      1      5 3       1   45            1      1      4 4       1   15            1      7      0 5       0   16            1      8      0 6       0   16            1      3      0 7       0   16            1      4      0 8       1   15            1      4      0 9       0   15            1      2      0 
like image 533
Roman Kazmin Avatar asked Oct 29 '19 16:10

Roman Kazmin


People also ask

How do I add columns to a DataFrame in Python?

In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.

How do you add a column with the same value in a DataFrame?

To add anew column with constant value, use the square bracket i.e. the index operator and set that value.

How do I add a column to a different DataFrame in pandas?

After extraction, the column needs to be simply added to the second dataframe using join() function. This function needs to be called with reference to the dataframe in which the column has to be added and the variable name which stores the extracted column name has to be passed to it as the argument.


2 Answers

Since score is a dictionary (so the keys are unique) we can use MultiIndex alignment

df = df.set_index(['gender', 'age', 'cholesterol', 'smoke']) df['score'] = pd.Series(score)  # Assign values based on the tuple df = df.fillna(0, downcast='infer').reset_index()  # Back to columns 

   gender  age  cholesterol  smoke  score 0       1   13            1      0      0 1       1   45            2      0      0 2       0    1            2      1      5 3       1   45            1      1      4 4       1   15            1      7      0 5       0   16            1      8      0 6       0   16            1      3      0 7       0   16            1      4      0 8       1   15            1      4      0 9       0   15            1      2      0 
like image 136
ALollz Avatar answered Nov 09 '22 03:11

ALollz


Using assign with a list comprehension, getting a tuple of values (each row) from the score dictionary, defaulting to zero if not found.

>>> df.assign(score=[score.get(tuple(row), 0) for row in df.values])    gender  age  cholesterol  smoke  score 0       1   13            1      0      0 1       1   45            2      0      0 2       0    1            2      1      5 3       1   45            1      1      4 4       1   15            1      7      0 5       0   16            1      8      0 6       0   16            1      3      0 7       0   16            1      4      0 8       1   15            1      4      0 9       0   15            1      2      0 

Timings

Given the variety of approaches, I though it would be interesting to compare some of the timings.

# Initial dataframe 100k rows (10 rows of identical data replicated 10k times). df = pd.DataFrame(data = {     'gender' :      [1,  1,  0, 1,  1,  0,  0,  0,  1,  0] * 10000,     'age' :         [13, 45, 1, 45, 15, 16, 16, 16, 15, 15] * 10000,     'cholesterol' : [1,  2,  2, 1, 1, 1, 1, 1, 1, 1] * 10000,     'smoke' :       [0,  0,  1, 1, 7, 8, 3, 4, 4, 2] * 10000},      dtype = np.int64)  %timeit -n 10 df.assign(score=[score.get(tuple(v), 0) for v in df.values]) # 223 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10  df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))]) # 76.8 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10 df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False)]) # 113 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %timeit -n 10 df.assign(score=df.apply(lambda x: score.get(tuple(x), 0), axis=1)) # 1.84 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10 (df  .set_index(['gender', 'age', 'cholesterol', 'smoke'])  .assign(score=pd.Series(score))  .fillna(0, downcast='infer')  .reset_index() ) # 138 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10 s=pd.Series(score) s.index.names=['gender','age','cholesterol','smoke'] df.merge(s.to_frame('score').reset_index(),how='left').fillna(0).astype(int) # 24 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10 df.assign(score=pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke))                 .map(score)                 .fillna(0)                 .astype(int)) # 191 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %%timeit -n 10 df.assign(score=df[['gender', 'age', 'cholesterol', 'smoke']]                 .apply(tuple, axis=1)                 .map(score)                 .fillna(0)) # 1.95 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
like image 35
Alexander Avatar answered Nov 09 '22 05:11

Alexander