consider df
df = pd.DataFrame(np.ones((2, 3)), columns=list('abc'))
df
col_list = list('bcd')
df[col_list]
generates an error
KeyError: "['d'] not in index"
How do I get as many of the columns as I can?
what about using Index.intersection()?
In [69]: df[df.columns.intersection(col_list)]
Out[69]:
b c
0 1.0 1.0
1 1.0 1.0
In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index
Timing:
In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df_.shape
Out[22]: (200000, 3)
In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')
In [24]: col_list = list('bcd')
In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop
In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop
let's test it on transposed DF (3 rows, 200K columns):
In [30]: t = df_.T
In [31]: t.shape
Out[31]: (3, 200000)
In [32]: t
Out[32]:
0 1 2 3 4 ... 199995 199996 199997 199998 199999
a 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
c 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
[3 rows x 200000 columns]
In [33]: col_list=[-10, -20, 10, 20, 100]
In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop
In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop
Conclusion: almost always list comprehension wins for smaller lists and Pandas/NumPy wins for bigger data sets...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With