I have a dataframe and a dictionary. I need to add a new column to the dataframe and calculate its values based on the dictionary.
Machine learning, adding new feature based on some table:
score = {(1, 45, 1, 1) : 4, (0, 1, 2, 1) : 5} df = pd.DataFrame(data = { 'gender' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0], 'age' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15], 'cholesterol' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1], 'smoke' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2]}, dtype = np.int64) print(df, '\n') df['score'] = 0 df.score = score[(df.gender, df.age, df.cholesterol, df.smoke)] print(df)
I expect the following output:
gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0
In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.
To add anew column with constant value, use the square bracket i.e. the index operator and set that value.
After extraction, the column needs to be simply added to the second dataframe using join() function. This function needs to be called with reference to the dataframe in which the column has to be added and the variable name which stores the extracted column name has to be passed to it as the argument.
Since score
is a dictionary (so the keys are unique) we can use MultiIndex
alignment
df = df.set_index(['gender', 'age', 'cholesterol', 'smoke']) df['score'] = pd.Series(score) # Assign values based on the tuple df = df.fillna(0, downcast='infer').reset_index() # Back to columns
gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0
Using assign
with a list comprehension, getting a tuple of values (each row) from the score
dictionary, defaulting to zero if not found.
>>> df.assign(score=[score.get(tuple(row), 0) for row in df.values]) gender age cholesterol smoke score 0 1 13 1 0 0 1 1 45 2 0 0 2 0 1 2 1 5 3 1 45 1 1 4 4 1 15 1 7 0 5 0 16 1 8 0 6 0 16 1 3 0 7 0 16 1 4 0 8 1 15 1 4 0 9 0 15 1 2 0
Timings
Given the variety of approaches, I though it would be interesting to compare some of the timings.
# Initial dataframe 100k rows (10 rows of identical data replicated 10k times). df = pd.DataFrame(data = { 'gender' : [1, 1, 0, 1, 1, 0, 0, 0, 1, 0] * 10000, 'age' : [13, 45, 1, 45, 15, 16, 16, 16, 15, 15] * 10000, 'cholesterol' : [1, 2, 2, 1, 1, 1, 1, 1, 1, 1] * 10000, 'smoke' : [0, 0, 1, 1, 7, 8, 3, 4, 4, 2] * 10000}, dtype = np.int64) %timeit -n 10 df.assign(score=[score.get(tuple(v), 0) for v in df.values]) # 223 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))]) # 76.8 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False)]) # 113 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit -n 10 df.assign(score=df.apply(lambda x: score.get(tuple(x), 0), axis=1)) # 1.84 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 (df .set_index(['gender', 'age', 'cholesterol', 'smoke']) .assign(score=pd.Series(score)) .fillna(0, downcast='infer') .reset_index() ) # 138 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 s=pd.Series(score) s.index.names=['gender','age','cholesterol','smoke'] df.merge(s.to_frame('score').reset_index(),how='left').fillna(0).astype(int) # 24 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke)) .map(score) .fillna(0) .astype(int)) # 191 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.assign(score=df[['gender', 'age', 'cholesterol', 'smoke']] .apply(tuple, axis=1) .map(score) .fillna(0)) # 1.95 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With