Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter data from a data frame when the number of columns are dynamic?

I have a data frame like below

    A_Name  B_Detail  Value_B  Value_C   Value_D ......
0   AA      X1        1.2      0.5       -1.3    ......
1   BB      Y1        0.76     -0.7      0.8     ......
2   CC      Z1        0.7      -1.3      2.5     ......
3   DD      L1        0.9      -0.5      0.4     ......
4   EE      M1        1.3      1.8       -1.3    ......
5   FF      N1        0.7      -0.8      0.9     ......
6   GG      K1        -2.4     -1.9      2.1     ......

This is just a sample of data frame, I can have n number of columns like (Value_A, Value_B, Value_C, ........... Value_N)

Now i want to filter all rows where absolute value of all columns (Value_A, Value_B, Value_C, ....) is less than 1.

If you have limited number of columns, you can filter the data by simply putting 'and' condition on columns in dataframe, but I am not able to figure out what to do in this case.

I don't know what would be number of such columns, the only thing I know that such columns would be prefixed with 'Value'.

In above case output should be like

    A_Name  B_Detail  Value_B  Value_C   Value_D ......
1   BB      Y1        0.76     -0.7      0.8     ......
3   DD      L1        0.9      -0.5      0.4     ......
5   FF      N1        0.7      -0.8      0.9     ......
like image 228
Sudheer1990 Avatar asked Jul 26 '16 06:07

Sudheer1990


1 Answers

Use filter with abs and all for creating mask and then boolean indexing:

mask = (df.filter(like='Value').abs() < 1).all(axis=1)
print (mask)
0    False
1     True
2    False
3     True
4    False
5     True
6    False
dtype: bool

print (df[mask])
  A_Name B_Detail  Value_B  Value_C  Value_D
1     BB       Y1     0.76     -0.7      0.8
3     DD       L1     0.90     -0.5      0.4
5     FF       N1     0.70     -0.8      0.9

All combination in timings:

#len df = 70k, 5 columns
df = pd.concat([df]*10000).reset_index(drop=True)

In [47]: %timeit (df[(df.filter(like='Value').abs() < 1).all(axis=1)])
100 loops, best of 3: 7.48 ms per loop

In [48]: %timeit (df[df.filter(regex=r'Value').abs().lt(1).all(1)])
100 loops, best of 3: 7.02 ms per loop

In [49]: %timeit (df[df.filter(like='Value').abs().lt(1).all(1)])
100 loops, best of 3: 7.02 ms per loop

In [50]: %timeit (df[(df.filter(regex=r'Value').abs() < 1).all(axis=1)])
100 loops, best of 3: 7.3 ms per loop

#len df = 70k, 5k columns
df = pd.concat([df]*10000).reset_index(drop=True)
df = pd.concat([df]*1000, axis=1)
#only for testing, create unique columns names
df.columns = df.columns.str[:-1] + [str(col) for col in list(range(df.shape[1]))]
print (df)

In [75]: %timeit ((df[(df.filter(like='Value').abs() < 1).all(axis=1)]))
1 loop, best of 3: 10.3 s per loop

In [76]: %timeit ((df[(df.filter(regex=r'Value').abs() < 1).all(axis=1)]))
1 loop, best of 3: 10.3 s per loop

In [77]: %timeit (df[df.filter(regex=r'Value').abs().lt(1).all(1)])
1 loop, best of 3: 10.4 s per loop

In [78]: %timeit (df[df.filter(like='Value').abs().lt(1).all(1)])
1 loop, best of 3: 10.1 s per loop
like image 116
jezrael Avatar answered Oct 03 '22 09:10

jezrael