Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

with python, select repeated elements longer than N

suppose that I have a dataframe as follows:

df = pd.DataFrame({'A':[1,1,2,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,6,6]})

df
Out[1]: 
        A
    0   1
    1   1
    2   2
    3   3
    4   3
    5   3
    6   3
    7   3
    8   4
    9   4
    10  4
    11  4
    12  4
    13  4
    14  4
    15  5
    16  5
    17  5
    18  5
    19  6
    20  6

I'm trying to filter the numbers that are repeated 4 times or more and the output will be:

df1
Out[2]:
    A
0   3
1   3
2   3
3   3
4   3
5   4
6   4
7   4
8   4
9   4
10  4
11  4
12  5
13  5
14  5
15  5

Right now I'm using itemfreq to extract that information, this results in a series of arrays where then is complicated to make a conditional and to filter only these numbers. I think that there must be other easiest way to do that. Some ideas? Thanks!

like image 525
Jonathan Pacheco Avatar asked Sep 01 '17 20:09

Jonathan Pacheco


People also ask

How do you repeat a value and a time in Python?

The * operator can also be used to repeat elements of a list. When we multiply a list with any number using the * operator, it repeats the elements of the given list. Here, we just have to keep in mind that to repeat the elements n times, we will have to multiply the list by (n+1).

How do I find duplicates in a column in Python?

To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.

How do I get unique values from a column in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.


2 Answers

groupby.filter is probably the easiest way:

df.groupby('A').filter(lambda x: x.size > 3)
Out: 
    A
3   3
4   3
5   3
6   3
7   3
8   4
9   4
10  4
11  4
12  4
13  4
14  4
15  5
16  5
17  5
18  5
like image 184
ayhan Avatar answered Oct 28 '22 04:10

ayhan


Approach #1 : One of the NumPy ways would be -

a = df.A.values
unq, c = np.unique(a, return_counts=1)
df_out = df[np.in1d(a,unq[c>=4])]

Approach #2 : NumPy + Pandas mix one-liner for positive numbers -

df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))]

Approach #3 : Using the fact that the dataframe is sorted on the relevant column, here's one deeper NumPy approach -

def filter_df(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    out_df = pd.DataFrame(out_arr)
    return out_df

Benchmarking

@piRSquared has covered an extensive benchmarking for all the approaches posted thus far and as it seems pir1_5 and div3 appears to be the top-2 there. But the timings seems comparable and it promoted me for a closer look. In that benchmarking, we had timeit(stmt, setp, number=10), which uses a constant number of iterations for running timeit, which isn't the most reliable method for timing, specially for small datasets. Also, the datasets seemed small there, as the timings for the biggest dataset were in micro-sec. So, to mitigate those two issues - I am proposing to use IPython's %timeit that automatically computes the optimal number of iterations for timeit to be run for, i.e. more number of iterations for smaller datasets than bigger ones. This should be more reliable. Also, I propose to include bigger datasets, so that the timings go into milli-sec and sec. So, with those couple of changes, new benchmarking setup looked something like this (remember to copy and paste onto IPython console to run this) -

sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000]
timings = np.zeros((len(sizes),2))
for i,s in enumerate(sizes):
    diffs = np.random.randint(100, size=s)
    d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs)))
    res = %timeit -oq div3(d)
    timings[i,0] = res.best
    res = %timeit -oq pir1_5(d)
    timings[i,1] = res.best
timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)')))
timings_df.index = sizes
timings_df.index.name = 'Datasizes'

For completeness, the approaches were -

def pir1_5(d):
    v = d.A.values
    t = np.flatnonzero(v[1:] != v[:-1])
    s = np.empty(t.size + 2, int)
    s[0] = -1
    s[-1] = v.size - 1
    s[1:-1] = t
    r = np.diff(s)
    return pd.DataFrame(v[(r > 3).repeat(r)])

def div3(df, N=4):
    a = df.A.values
    mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
    idx = np.flatnonzero(mask)
    count = idx[1:] - idx[:-1]

    valid_mask = count>=N
    good_idx = idx[:-1][valid_mask]
    out_arr = np.repeat(a[good_idx], count[valid_mask])
    return pd.DataFrame(out_arr)

The timing setup was run on IPython console (as we are using magic funcs). The results looked like this -

In [265]: timings_df
Out[265]: 
           div3(sec)  pir1_5(sec)
Datasizes                        
10          0.000090     0.000089
30          0.000096     0.000097
100         0.000109     0.000118
300         0.000157     0.000182
1000        0.000303     0.000396
3000        0.000713     0.000998
10000       0.002252     0.003701
100000      0.023257     0.036480
1000000     0.258133     0.398812
10000000    2.603467     3.759063

Thus, speedup figures with div3 over pir1_5 are :

In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0]
Out[266]: 
Datasizes
10          0.997704
30          1.016446
100         1.077129
300         1.163333
1000        1.304689
3000        1.400464
10000       1.643474
100000      1.568554
1000000     1.544987
10000000    1.443868
like image 44
Divakar Avatar answered Oct 28 '22 04:10

Divakar