Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace 1's with 0's in a sequence

Tags:

python

numpy

I have a huge list of 1's and 0's like this :

x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

Full list here.

I want to create a new list y with the condition that , the 1's should be preserved only if the they occur in a sequence of >= than 10, else those 1's should be replaced by zeroes.
ex based on x above ^ , y should become:

y = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1].  

So far I have the following :

  1. Finding out where the changes are occurring and
  2. Finding out what sequences are occurring with what frequency:

import numpy as np
import itertools
nx = np.array(x)
print np.argwhere(np.diff(nx)).squeeze()

answer = []
for key, iter in itertools.groupby(nx):
    answer.append((key, len(list(iter))))
print answer

which gives me :

[0 3 8 14]  # A
[(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B

#A which means the changes are happening after the 0th, 3rd and so on positions.

#B means there is one 1, followed by three 0's followed by five 1's followed by 6 zeroes followed by 10 1's.

How do I proceed to the final step of creating y where we will be replacing the 1's with 0's depending upon the sequence length?

PS: ##I'm humbled by these brilliant solutions from all the wonderful people.

like image 875
Seirra Avatar asked Mar 06 '18 20:03

Seirra


3 Answers

Just check while you are iterating over the group-by. Something like:

>>> from itertools import groupby
>>> x = [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]
>>> result = []
>>> for k, g in groupby(x):
...     if k:
...         g = list(g)
...         if len(g) < 10:
...             g = len(g)*[0]
...     result.extend(g)
...
>>> result
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Note, this is faster than the corresponding pandas solution, for a dataset of this size at least:

In [11]: from itertools import groupby

In [12]: %%timeit
    ...: result = []
    ...: for k, g in groupby(x):
    ...:     if k:
    ...:         g = list(g)
    ...:         if len(g) < 10:
    ...:             g = len(g)*[0]
    ...:     result.extend(g)
    ...:
181 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %%timeit s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...:
4.03 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And note, that's being generous with the pandas solution, not counting any time converting from list to pd.Series or converting back, including those:

In [14]: %%timeit
    ...: s = pd.Series(x)
    ...: s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    ...: s = s.tolist()
    ...:
4.92 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 198
juanpa.arrivillaga Avatar answered Nov 15 '22 01:11

juanpa.arrivillaga


Here is another numpy approach. Please take note of the benchmarks at the bottom of this post:

import numpy as np
import pandas as pd
from itertools import groupby
import re
from timeit import timeit

def f_pp(data):
    switches = np.empty((data.size + 1,), bool)
    switches[0] = data[0]
    switches[-1] = data[-1]
    switches[1:-1] = data[:-1]^data[1:]
    switches = np.where(switches)[0].reshape(-1, 2)
    switches = switches[switches[:, 1]-switches[:, 0] >= 10].ravel()
    reps = np.empty((switches.size + 1,), int)
    reps[1:-1] = np.diff(switches)
    reps[0] = switches[0]
    reps[-1] = data.size - switches[-1]
    return np.repeat(np.arange(reps.size) & 1, reps)

def f_ja(data):
    result = []
    for k, g in groupby(data):
        if k:
            g = list(g)
            if len(g) < 10:
                g = len(g)*[0]
        result.extend(g)
    return result

def f_mu(s):
    s = s.copy()
    s[s.groupby(s.ne(1).cumsum()).transform('count').lt(10)] = 0
    return s

def vrange(starts, stops):
     stops = np.asarray(stops)
     l = stops - starts # Lengths of each range.
     return np.repeat(stops - l.cumsum(), l) + np.arange(l.sum())

def f_ka(data):
    x = data.copy()
    d = np.where(np.diff(x) != 0)[0]
    d2 = np.diff(np.concatenate(([0], d, [x.size])))
    ind = np.where(d2 >= 10)[0] - 1
    x[vrange(d[ind] + 1, d[ind + 1] + 2)] = 0
    return x

def f_ol(data):
    return list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(data)))

n = 10_000
data = np.repeat((np.arange(n) + np.random.randint(2))&1, np.random.randint(1, 20, (n,)))
datal = data.tolist()
datap = pd.Series(data)

kwds = dict(globals=globals(), number=100)

print(np.where(f_ja(datal) != f_pp(data))[0])
print(np.where(f_ol(datal) != f_pp(data))[0])
#print(np.where(f_ka(data) != f_pp(data))[0])
print(np.where(f_mu(datap).values != f_pp(data))[0])

print('itertools.groupby: {:6.3f} ms'.format(10 * timeit('f_ja(datal)', **kwds)))
print('re:                {:6.3f} ms'.format(10 * timeit('f_ol(datal)', **kwds)))
#print('numpy Kasramvd:    {:6.3f} ms'.format(10 * timeit('f_ka(data)', **kwds)))
print('pandas:            {:6.3f} ms'.format(10 * timeit('f_mu(datap)', **kwds)))
print('numpy pp:          {:6.3f} ms'.format(10 * timeit('f_pp(data)', **kwds)))

Sample output:

[]                                        # Delta ja, pp
[]                                        # Delta ol, pp
[  749   750   751 ... 98786 98787 98788] # Delta mu, pp
itertools.groupby:  5.415 ms
re:                28.197 ms
pandas:            14.972 ms
numpy pp:           0.788 ms

Note only from scratch solutions considered. @Olivier's @juanpa.arrivillaga's and my approach yield same answer, @MaxU's doesn't. Couldn't get @Kazramvd's to finish reliably. (May be my fault - don't know pandas and didn't fully understand @Kazramvd's solution).

Note that is only one example, other conditions (like shorter lists, more switches, etc.) may change the ranking.

like image 38
Paul Panzer Avatar answered Nov 15 '22 01:11

Paul Panzer


With list comprehension

From your encoded list B, you can use list comprehension to generate the new list.

b = [(1, 1), (0, 3), (1, 5), (0, 6), (1, 10)] # B

y = sum(([num and int(rep >= 10)] * rep for num, rep in b), [])

From the start with re

Alternatively, from the start this looks like something re could do as it can work with bytes.

import re

x = [1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y = list(re.sub(b'(?<!\x01)\x01{,9}(?!\x01)', lambda m: len(m.group()) * b'\x00', bytes(x)))

Both solutions output:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
like image 21
Olivier Melançon Avatar answered Nov 14 '22 23:11

Olivier Melançon