suppose that I have a dataframe as follows: <pre class="prettyprint"><code>df = pd.DataFrame({'A':[1,1,2,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,6,6]}) df Out[1]: A 0 1 1 1 2 2 3 3 4 3 5 3 6 3 7 3 8 4 9 4 10 4 11 4 12 4 13 4 14 4 15 5 16 5 17 5 18 5 19 6 20 6 </code></pre> I'm trying to filter the numbers that are repeated 4 times or more and the output will be: <pre class="prettyprint"><code>df1 Out[2]: A 0 3 1 3 2 3 3 3 4 3 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 5 13 5 14 5 15 5 </code></pre> Right now I'm using <code>itemfreq</code> to extract that information, this results in a series of arrays where then is complicated to make a conditional and to filter only these numbers. I think that there must be other easiest way to do that. Some ideas? Thanks!

groupby.filter is probably the easiest way: <pre class="prettyprint"><code>df.groupby('A').filter(lambda x: x.size > 3) Out: A 3 3 4 3 5 3 6 3 7 3 8 4 9 4 10 4 11 4 12 4 13 4 14 4 15 5 16 5 17 5 18 5 </code></pre>

Approach #1 : One of the NumPy ways would be - <pre class="prettyprint"><code>a = df.A.values unq, c = np.unique(a, return_counts=1) df_out = df[np.in1d(a,unq[c>=4])] </code></pre> Approach #2 : NumPy + Pandas mix one-liner for positive numbers - <pre class="prettyprint"><code>df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))] </code></pre> Approach #3 : Using the fact that the dataframe is sorted on the relevant column, here's one deeper NumPy approach - <pre class="prettyprint"><code>def filter_df(df, N=4): a = df.A.values mask = np.concatenate(( [True], a[1:] != a[:-1], [True] )) idx = np.flatnonzero(mask) count = idx[1:] - idx[:-1] valid_mask = count>=N good_idx = idx[:-1][valid_mask] out_arr = np.repeat(a[good_idx], count[valid_mask]) out_df = pd.DataFrame(out_arr) return out_df </code></pre> <hr> <h3>Benchmarking</h3> <code>@piRSquared</code> has covered an extensive benchmarking for all the approaches posted thus far and as it seems <code>pir1_5</code> and <code>div3</code> appears to be the <code>top-2</code> there. But the timings seems comparable and it promoted me for a closer look. In that benchmarking, we had <code>timeit(stmt, setp, number=10)</code>, which uses a constant number of iterations for running <code>timeit</code>, which isn't the most reliable method for timing, specially for small datasets. Also, the datasets seemed small there, as the timings for the biggest dataset were in <code>micro-sec</code>. So, to mitigate those two issues - I am proposing to use IPython's <code>%timeit</code> that automatically computes the optimal number of iterations for timeit to be run for, i.e. more number of iterations for smaller datasets than bigger ones. This should be more reliable. Also, I propose to include bigger datasets, so that the timings go into <code>milli-sec</code> and <code>sec</code>. So, with those couple of changes, new benchmarking setup looked something like this (remember to copy and paste onto IPython console to run this) - <pre class="prettyprint"><code>sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000] timings = np.zeros((len(sizes),2)) for i,s in enumerate(sizes): diffs = np.random.randint(100, size=s) d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs))) res = %timeit -oq div3(d) timings[i,0] = res.best res = %timeit -oq pir1_5(d) timings[i,1] = res.best timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)'))) timings_df.index = sizes timings_df.index.name = 'Datasizes' </code></pre> For completeness, the approaches were - <pre class="prettyprint"><code>def pir1_5(d): v = d.A.values t = np.flatnonzero(v[1:] != v[:-1]) s = np.empty(t.size + 2, int) s[0] = -1 s[-1] = v.size - 1 s[1:-1] = t r = np.diff(s) return pd.DataFrame(v[(r > 3).repeat(r)]) def div3(df, N=4): a = df.A.values mask = np.concatenate(( [True], a[1:] != a[:-1], [True] )) idx = np.flatnonzero(mask) count = idx[1:] - idx[:-1] valid_mask = count>=N good_idx = idx[:-1][valid_mask] out_arr = np.repeat(a[good_idx], count[valid_mask]) return pd.DataFrame(out_arr) </code></pre> The timing setup was run on IPython console (as we are using magic funcs). The results looked like this - <pre class="prettyprint"><code>In [265]: timings_df Out[265]: div3(sec) pir1_5(sec) Datasizes 10 0.000090 0.000089 30 0.000096 0.000097 100 0.000109 0.000118 300 0.000157 0.000182 1000 0.000303 0.000396 3000 0.000713 0.000998 10000 0.002252 0.003701 100000 0.023257 0.036480 1000000 0.258133 0.398812 10000000 2.603467 3.759063 </code></pre> Thus, speedup figures with <code>div3</code> over <code>pir1_5</code> are : <pre class="prettyprint"><code>In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0] Out[266]: Datasizes 10 0.997704 30 1.016446 100 1.077129 300 1.163333 1000 1.304689 3000 1.400464 10000 1.643474 100000 1.568554 1000000 1.544987 10000000 1.443868 </code></pre>

with python, select repeated elements longer than N

Tags:

python

pandas

numpy

conditional

suppose that I have a dataframe as follows:

df = pd.DataFrame({'A':[1,1,2,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,6,6]})

df
Out[1]: 
        A
    0   1
    1   1
    2   2
    3   3
    4   3
    5   3
    6   3
    7   3
    8   4
    9   4
    10  4
    11  4
    12  4
    13  4
    14  4
    15  5
    16  5
    17  5
    18  5
    19  6
    20  6

I'm trying to filter the numbers that are repeated 4 times or more and the output will be:

Right now I'm using itemfreq to extract that information, this results in a series of arrays where then is complicated to make a conditional and to filter only these numbers. I think that there must be other easiest way to do that. Some ideas? Thanks!

525

asked Sep 01 '17 20:09

Jonathan Pacheco

2 Answers

groupby.filter is probably the easiest way:

df.groupby('A').filter(lambda x: x.size > 3)
Out: 
    A
3   3
4   3
5   3
6   3
7   3
8   4
9   4
10  4
11  4
12  4
13  4
14  4
15  5
16  5
17  5
18  5

184

answered Oct 28 '22 04:10

ayhan

Approach #1 : One of the NumPy ways would be -

a = df.A.values
unq, c = np.unique(a, return_counts=1)
df_out = df[np.in1d(a,unq[c>=4])]

Approach #2 : NumPy + Pandas mix one-liner for positive numbers -

df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))]

Approach #3 : Using the fact that the dataframe is sorted on the relevant column, here's one deeper NumPy approach -

def filter_df(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    out_df = pd.DataFrame(out_arr)
    return out_df

Benchmarking

@piRSquared has covered an extensive benchmarking for all the approaches posted thus far and as it seems pir1_5 and div3 appears to be the top-2 there. But the timings seems comparable and it promoted me for a closer look. In that benchmarking, we had timeit(stmt, setp, number=10), which uses a constant number of iterations for running timeit, which isn't the most reliable method for timing, specially for small datasets. Also, the datasets seemed small there, as the timings for the biggest dataset were in micro-sec. So, to mitigate those two issues - I am proposing to use IPython's %timeit that automatically computes the optimal number of iterations for timeit to be run for, i.e. more number of iterations for smaller datasets than bigger ones. This should be more reliable. Also, I propose to include bigger datasets, so that the timings go into milli-sec and sec. So, with those couple of changes, new benchmarking setup looked something like this (remember to copy and paste onto IPython console to run this) -

sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000]
timings = np.zeros((len(sizes),2))
for i,s in enumerate(sizes):
    diffs = np.random.randint(100, size=s)
    d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs)))
    res = %timeit -oq div3(d)
    timings[i,0] = res.best
    res = %timeit -oq pir1_5(d)
    timings[i,1] = res.best
timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)')))
timings_df.index = sizes
timings_df.index.name = 'Datasizes'

For completeness, the approaches were -

def pir1_5(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return pd.DataFrame(v[(r > 3).repeat(r)])

def div3(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    return pd.DataFrame(out_arr)

The timing setup was run on IPython console (as we are using magic funcs). The results looked like this -

In [265]: timings_df
Out[265]: 
           div3(sec)  pir1_5(sec)
Datasizes                        
10          0.000090     0.000089
30          0.000096     0.000097
100         0.000109     0.000118
300         0.000157     0.000182
1000        0.000303     0.000396
3000        0.000713     0.000998
10000       0.002252     0.003701
100000      0.023257     0.036480
1000000     0.258133     0.398812
10000000    2.603467     3.759063

Thus, speedup figures with div3 over pir1_5 are :

In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0]
Out[266]: 
Datasizes
10          0.997704
30          1.016446
100         1.077129
300         1.163333
1000        1.304689
3000        1.400464
10000       1.643474
100000      1.568554
1000000     1.544987
10000000    1.443868

answered Oct 28 '22 04:10

Divakar

Related questions
                            
                                How to calculate conditional probability of values in dataframe pandas-python?
                            
                                instagram api keep raise 'You must provide a client_id' exception when I use python-instagram library
                            
                                How to count one specific word in Python?
                            
                                Select xarray/pandas index based on specific months
                            
                                Meaning of '\0\0' in Python?
                            
                                How to extract digits from a number from left to right?
                            
                                When to use one or two underscore in Python [duplicate]
                            
                                How to solve 'module' object has no attribute '_base' issue?
                            
                                ModuleNotFoundError: No module named 'bs4'
                            
                                How to check if a list contains a boolean value
                            
                                Python: convert matrix to positive semi-definite
                            
                                Drop Duplicates in a DataFrame Keeping the Row with the Least Nulls
                            
                                How to read/write a matrix from a persistent XML/YAML file in OpenCV 3 with python?
                            
                                Anaconda Python: Delete .tar.gz in pkgs
                            
                                Python TA-lib install error, how solve it?
                            
                                django.db.utils.ProgrammingError: cannot cast type text[] to jsonb
                            
                                Evaluating the LightFM Recommendation Model
                            
                                Best parameters solved by Hyperopt is unsuitable
                            
                                How to unpack a list of tuples with enumerate? [duplicate]
                            
                                Python yield (migrating from Ruby): How can I write a function without arguments and only with yield to do prints?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With