suppose that I have a dataframe as follows:
df = pd.DataFrame({'A':[1,1,2,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5,6,6]})
df
Out[1]:
A
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 3
8 4
9 4
10 4
11 4
12 4
13 4
14 4
15 5
16 5
17 5
18 5
19 6
20 6
I'm trying to filter the numbers that are repeated 4 times or more and the output will be:
df1
Out[2]:
A
0 3
1 3
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 5
15 5
Right now I'm using itemfreq
to extract that information, this results in a series of arrays where then is complicated to make a conditional and to filter only these numbers. I think that there must be other easiest way to do that. Some ideas? Thanks!
The * operator can also be used to repeat elements of a list. When we multiply a list with any number using the * operator, it repeats the elements of the given list. Here, we just have to keep in mind that to repeat the elements n times, we will have to multiply the list by (n+1).
To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
groupby.filter is probably the easiest way:
df.groupby('A').filter(lambda x: x.size > 3)
Out:
A
3 3
4 3
5 3
6 3
7 3
8 4
9 4
10 4
11 4
12 4
13 4
14 4
15 5
16 5
17 5
18 5
Approach #1 : One of the NumPy ways would be -
a = df.A.values
unq, c = np.unique(a, return_counts=1)
df_out = df[np.in1d(a,unq[c>=4])]
Approach #2 : NumPy + Pandas mix one-liner for positive numbers -
df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))]
Approach #3 : Using the fact that the dataframe is sorted on the relevant column, here's one deeper NumPy approach -
def filter_df(df, N=4):
a = df.A.values
mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
idx = np.flatnonzero(mask)
count = idx[1:] - idx[:-1]
valid_mask = count>=N
good_idx = idx[:-1][valid_mask]
out_arr = np.repeat(a[good_idx], count[valid_mask])
out_df = pd.DataFrame(out_arr)
return out_df
@piRSquared
has covered an extensive benchmarking for all the approaches posted thus far and as it seems pir1_5
and div3
appears to be the top-2
there. But the timings seems comparable and it promoted me for a closer look. In that benchmarking, we had timeit(stmt, setp, number=10)
, which uses a constant number of iterations for running timeit
, which isn't the most reliable method for timing, specially for small datasets. Also, the datasets seemed small there, as the timings for the biggest dataset were in micro-sec
. So, to mitigate those two issues - I am proposing to use IPython's %timeit
that automatically computes the optimal number of iterations for timeit to be run for, i.e. more number of iterations for smaller datasets than bigger ones. This should be more reliable. Also, I propose to include bigger datasets, so that the timings go into milli-sec
and sec
. So, with those couple of changes, new benchmarking setup looked something like this (remember to copy and paste onto IPython console to run this) -
sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000]
timings = np.zeros((len(sizes),2))
for i,s in enumerate(sizes):
diffs = np.random.randint(100, size=s)
d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs)))
res = %timeit -oq div3(d)
timings[i,0] = res.best
res = %timeit -oq pir1_5(d)
timings[i,1] = res.best
timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)')))
timings_df.index = sizes
timings_df.index.name = 'Datasizes'
For completeness, the approaches were -
def pir1_5(d):
v = d.A.values
t = np.flatnonzero(v[1:] != v[:-1])
s = np.empty(t.size + 2, int)
s[0] = -1
s[-1] = v.size - 1
s[1:-1] = t
r = np.diff(s)
return pd.DataFrame(v[(r > 3).repeat(r)])
def div3(df, N=4):
a = df.A.values
mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
idx = np.flatnonzero(mask)
count = idx[1:] - idx[:-1]
valid_mask = count>=N
good_idx = idx[:-1][valid_mask]
out_arr = np.repeat(a[good_idx], count[valid_mask])
return pd.DataFrame(out_arr)
The timing setup was run on IPython console (as we are using magic funcs). The results looked like this -
In [265]: timings_df
Out[265]:
div3(sec) pir1_5(sec)
Datasizes
10 0.000090 0.000089
30 0.000096 0.000097
100 0.000109 0.000118
300 0.000157 0.000182
1000 0.000303 0.000396
3000 0.000713 0.000998
10000 0.002252 0.003701
100000 0.023257 0.036480
1000000 0.258133 0.398812
10000000 2.603467 3.759063
Thus, speedup figures with div3
over pir1_5
are :
In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0]
Out[266]:
Datasizes
10 0.997704
30 1.016446
100 1.077129
300 1.163333
1000 1.304689
3000 1.400464
10000 1.643474
100000 1.568554
1000000 1.544987
10000000 1.443868
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With