Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, are there any faster ways to update values?

Tags:

python

pandas

Currently, my table has over 10000000 records, and there is a column named ID, and I want to update column named '3rd_col' with a new value if the ID is in the given list.

I use .loc and here is my code

for _id in given_ids:
    df.loc[df.ID == _id, '3rd_col'] = new_value

But the performance of the above code is slow, how can I improve the performance of updating value?

Sorry, here I want to be more specific on my problem, different id has different values to be assigned based on a function and there are about 4 columns to be assigned.

for _id in given_ids:
    df.loc[df.ID == _id, '3rd_col'] = return_new_val_1(id)
    df.loc[df.ID == _id, '4rd_col'] = return_new_val_2(id)
    df.loc[df.ID == _id, '5rd_col'] = return_new_val_3(id)
    df.loc[df.ID == _id, '6rd_col'] = return_new_val_4(id)
like image 827
GoingMyWay Avatar asked Nov 16 '17 12:11

GoingMyWay


1 Answers

You can create dictionary first and then replace:

#sample function
def return_new_val(x):
    return x * 3

given_ids = list('abc')

d = {_id: return_new_val(_id) for _id in given_ids}
print (d)
{'a': 'aaa', 'c': 'ccc', 'b': 'bbb'}

df = pd.DataFrame({'ID':list('abdefc'),
                   'M':[4,5,4,5,5,4]})


df['3rd_col'] = df['ID'].replace(d)
print (df)

  ID  M 3rd_col
0  a  4     aaa
1  b  5     bbb
2  d  4       d
3  e  5       e
4  f  5       f
5  c  4     ccc

Or map, but then get NaNs for no match:

df['3rd_col'] = df['ID'].map(d)
print (df)

  ID  M 3rd_col
0  a  4     aaa
1  b  5     bbb
2  d  4     NaN
3  e  5     NaN
4  f  5     NaN
5  c  4     ccc

EDIT:

If need append data by multiple functions first create new DataFrame and then join to original:

def return_new_val1(x):
    return x * 2

def return_new_val2(x):
    return x * 3


given_ids = list('abc')
df2 = pd.DataFrame({'ID':given_ids})
df2['3rd_col'] = df2['ID'].map(return_new_val1)
df2['4rd_col'] = df2['ID'].map(return_new_val2)
df2 = df2.set_index('ID')
print (df2)
   3rd_col 4rd_col
ID                
a       aa     aaa
b       bb     bbb
c       cc     ccc    

df = pd.DataFrame({'ID':list('abdefc'),
                   'M':[4,5,4,5,5,4]})

df = df.join(df2, on='ID')
print (df)
  ID  M 3rd_col 4rd_col
0  a  4      aa     aaa
1  b  5      bb     bbb
2  d  4     NaN     NaN
3  e  5     NaN     NaN
4  f  5     NaN     NaN
5  c  4      cc     ccc

#bur replace NaNs by values in `ID`
cols = ['3rd_col','4rd_col']
df[cols] = df[cols].mask(df[cols].isnull(), df['ID'], axis=0)
print (df)
  ID  M 3rd_col 4rd_col
0  a  4      aa     aaa
1  b  5      bb     bbb
2  d  4       d       d
3  e  5       e       e
4  f  5       f       f
5  c  4      cc     ccc
like image 90
jezrael Avatar answered Oct 26 '22 08:10

jezrael