Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and replace multiple values in python

I want to find and replace multiple values in an 1D array / list with new ones.

In example for a list

a=[2, 3, 2, 5, 4, 4, 1, 2]

I would like to replace

val_old=[1, 2, 3, 4, 5] 

with

val_new=[2, 3, 4, 5, 1]

Therefore the new array is:

a_new=[3, 4, 3, 1, 5, 5, 2, 3]

What is the fastest way to do this (for very large lists, i.e. with 50000 values to find and replace)?

Comment of the anwsers

Thank you to all for a quick response! I checked the proposed solutions with the following:

N = 10**4
N_val = 0.5*N
a = np.random.randint(0, N_val, size=N)
val_old = np.arange(N_val, dtype=np.int)
val_new = np.arange(N_val, dtype=np.int)
np.random.shuffle(val_new)

a1 = list(a)
val_old1 = list(val_old)
val_new1 = list(val_new)

def Ashwini_Chaudhary(a, val_old, val_new):
    arr = np.empty(a.max()+1, dtype=val_new.dtype)
    arr[val_old] = val_new
    return arr[a]

def EdChum(a, val_old, val_new):
    df = pd.Series(a, dtype=val_new.dtype)
    d = dict(zip(val_old, val_new))
    return df.map(d).values   

def xxyzzy(a, val_old, val_new):
    return [val_new[val_old.index(x)] for x in a]

def Shashank_and_Hackaholic(a, val_old, val_new):
    d = dict(zip(val_old, val_new))
    return [d.get(e, e) for e in a]

def itzmeontv(a, val_old, val_new):
    return [val_new[val_old.index(i)] if i in val_old else i for i in a]

def swenzel(a, val_old, val_new):
    return val_new[np.searchsorted(val_old,a)]

def Divakar(a, val_old, val_new):
    C,R = np.where(a[:,np.newaxis] == val_old[np.newaxis,:])
    a[C] = val_new[R]
    return a

The results:

%timeit -n100 Ashwini_Chaudhary(a, val_old, val_new)
100 loops, best of 3: 77.6 µs per loop

%timeit -n100 swenzel(a, val_old, val_new)
100 loops, best of 3: 703 µs per loop

%timeit -n100 Shashank_and_Hackaholic(a1, val_old1, val_new1)
100 loops, best of 3: 1.7 ms per loop

%timeit -n100 EdChum(a, val_old, val_new)
100 loops, best of 3: 17.6 ms per loop

%timeit -n10 Divakar(a, val_old, val_new)
10 loops, best of 3: 209 ms per loop

%timeit -n10 xxyzzy(a1, val_old1, val_new1)
10 loops, best of 3: 429 ms per loop

%timeit -n10 itzmeontv(a1, val_old1, val_new1)
10 loops, best of 3: 847 ms per loop

The relative difference in performance increases with biger N , i.e. if N=10**7, then the result by Ashwini_Chaudhary takes 207 ms and the result by swenzel 6.89 s.

like image 441
blaz Avatar asked Apr 02 '15 08:04

blaz


3 Answers

>>> arr = np.empty(a.max() + 1, dtype=val_new.dtype)
>>> arr[val_old] = val_new
>>> arr[a]
array([3, 4, 3, 1, 5, 5, 2, 3])
like image 77
Ashwini Chaudhary Avatar answered Oct 11 '22 06:10

Ashwini Chaudhary


Assuming that your val_old array is sorted (which is the case here, but if later on it's not, then don't forget to sort val_new along with it!), you can use numpy.searchsorted and then access val_new with the results.
This does not work if a number has no mapping, you will have to provide 1to1 mappings in that case.

In [1]: import numpy as np

In [2]: a = np.array([2, 3, 2, 5, 4, 4, 1, 2])

In [3]: old_val = np.array([1, 2, 3, 4, 5])

In [4]: new_val = np.array([2, 3, 4, 5, 1])

In [5]: a_new = np.array([3, 4, 3, 1, 5, 5, 2, 3])

In [6]: i = np.searchsorted(old_val,a)

In [7]: a_replaced = new_val[i]

In [8]: all(a_replaced == a_new)
Out[8]: True

50k numbers? No problem!

In [23]: def timed():
    t0 = time.time()
    i = np.searchsorted(old_val, a)
    a_replaced = new_val[i]
    t1 = time.time()
    print('%s Seconds'%(t1-t0))
   ....: 

In [24]: a = np.random.choice(old_val, 50000)

In [25]: timed()
0.00288081169128 Seconds

500k? You won't notice the difference!

In [26]: a = np.random.choice(old_val, 500000)

In [27]: timed()
0.019248008728 Seconds
like image 43
swenzel Avatar answered Oct 11 '22 06:10

swenzel


In vanilla Python, without the speed of numpy or pandas, this is one way:

a = [2, 3, 2, 5, 4, 4, 1, 2]
val_old = [1, 2, 3, 4, 5]
val_new = [2, 3, 4, 5, 1]
expected_a_new = [3, 4, 3, 1, 5, 5, 2, 3]
d = dict(zip(val_old, val_new))
a_new = [d.get(e, e) for e in a]
print a_new # [3, 4, 3, 1, 5, 5, 2, 3]
print a_new == expected_a_new # True

The average time complexity for this algorithm is O(M + N) where M is the length of your "translation list" and N is the length of list a.

like image 42
Shashank Avatar answered Oct 11 '22 06:10

Shashank