I have a huge list of 1's and 0's like this :
x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].
Full list here.
I want to create a new list y with the condition that , the 1's should be preserved only if the they occur in a sequence of >= than 10, else those 1's should be replaced by zeroes.
ex based on x
above ^ , y
should become:
y = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].
So far I have the following :
import numpy as np
import itertools
nx = np.array(x)
print np.argwhere(np.diff(nx)).squeeze()
answer = []
for key, iter in itertools.groupby(nx):
answer.append((key, len(list(iter))))
print answer
which gives me :
[0 3 8 14] # A
[(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B
#A
which means the changes are happening after the 0th, 3rd and so on positions.
#B
means there is one 1, followed by three 0's followed by five 1's followed by 6 zeroes followed by 10 1's.
How do I proceed to the final step of creating y
where we will be replacing the 1's with 0's depending upon the sequence length?
PS: ##I'm humbled by these brilliant solutions from all the wonderful people.
Just check while you are iterating over the group-by. Something like:
>>> from itertools import groupby
>>> x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]
>>> result = []
>>> for k, g in groupby(x):
... if k:
... g = list(g)
... if len(g) < 10:
... g = len(g)*[0]
... result.extend(g)
...
>>> result
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Note, this is faster than the corresponding pandas
solution, for a dataset of this size at least:
In [11]: from itertools import groupby
In [12]: %%timeit
...: result = []
...: for k, g in groupby(x):
...: if k:
...: g = list(g)
...: if len(g) < 10:
...: g = len(g)*[0]
...: result.extend(g)
...:
181 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %%timeit s = pd.Series(x)
...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
...:
4.03 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And note, that's being generous with the pandas solution, not counting any time converting from list to pd.Series
or converting back, including those:
In [14]: %%timeit
...: s = pd.Series(x)
...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
...: s = s.tolist()
...:
4.92 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here is another numpy approach. Please take note of the benchmarks at the bottom of this post:
import numpy as np
import pandas as pd
from itertools import groupby
import re
from timeit import timeit
def f_pp(data):
switches = np.empty((data.size + 1,), bool)
switches[0] = data[0]
switches[-1] = data[-1]
switches[1:-1] = data[:-1]^data[1:]
switches = np.where(switches)[0].reshape(-1, 2)
switches = switches[switches[:, 1]-switches[:, 0] >= 10].ravel()
reps = np.empty((switches.size + 1,), int)
reps[1:-1] = np.diff(switches)
reps[0] = switches[0]
reps[-1] = data.size - switches[-1]
return np.repeat(np.arange(reps.size) & 1, reps)
def f_ja(data):
result = []
for k, g in groupby(data):
if k:
g = list(g)
if len(g) < 10:
g = len(g)*[0]
result.extend(g)
return result
def f_mu(s):
s = s.copy()
s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
return s
def vrange(starts, stops):
stops = np.asarray(stops)
l = stops - starts # Lengths of each range.
return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())
def f_ka(data):
x = data.copy()
d = np.where(np.diff(x) != 0)[0]
d2 = np.diff(np.concatenate(([0], d, [x.size])))
ind = np.where(d2 >= 10)[0] - 1
x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
return x
def f_ol(data):
return list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(data)))
n = 10_000
data = np.repeat((np.arange(n) + np.random.randint(2))&1, np.random.randint(1, 20, (n,)))
datal = data.tolist()
datap = pd.Series(data)
kwds = dict(globals=globals(), number=100)
print(np.where(f_ja(datal) != f_pp(data))[0])
print(np.where(f_ol(datal) != f_pp(data))[0])
#print(np.where(f_ka(data) != f_pp(data))[0])
print(np.where(f_mu(datap).values != f_pp(data))[0])
print('itertools.groupby: {:6.3f} ms'.format(10 * timeit('f_ja(datal)', **kwds)))
print('re: {:6.3f} ms'.format(10 * timeit('f_ol(datal)', **kwds)))
#print('numpy Kasramvd: {:6.3f} ms'.format(10 * timeit('f_ka(data)', **kwds)))
print('pandas: {:6.3f} ms'.format(10 * timeit('f_mu(datap)', **kwds)))
print('numpy pp: {:6.3f} ms'.format(10 * timeit('f_pp(data)', **kwds)))
Sample output:
[] # Delta ja, pp
[] # Delta ol, pp
[ 749 750 751 ... 98786 98787 98788] # Delta mu, pp
itertools.groupby: 5.415 ms
re: 28.197 ms
pandas: 14.972 ms
numpy pp: 0.788 ms
Note only from scratch solutions considered. @Olivier's @juanpa.arrivillaga's and my approach yield same answer, @MaxU's doesn't. Couldn't get @Kazramvd's to finish reliably. (May be my fault - don't know pandas and didn't fully understand @Kazramvd's solution).
Note that is only one example, other conditions (like shorter lists, more switches, etc.) may change the ranking.
From your encoded list B, you can use list comprehension to generate the new list.
b = [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B
y = sum(([num and int(rep >= 10)] * rep for num, rep in b), [])
re
Alternatively, from the start this looks like something re
could do as it can work with bytes
.
import re
x = [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
y = list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(x)))
Both solutions output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With