I am new to python so sorry if this is too obvious.
I have a dataframe that looks like below:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 10))
df.columns = ['date1', 'date2', 'date3', 'name1', 'col1', 'col2', 'col3', 'name2', 'date4', 'date5']
date1 date2 date3 name1 col1 col2 col3 \
0 -0.177090 0.417442 -0.930226 0.460750 1.062997 0.534942 -1.082967
1 -0.942154 0.047837 -0.494979 2.437469 -0.446984 0.709556 -0.135978
2 -1.544783 0.129307 -0.169556 -0.890697 2.650924 0.976610 0.290226
3 -0.651220 -0.196342 0.712601 0.641927 -0.009921 -0.038450 0.498087
4 -0.299145 -1.407747 1.914364 0.554330 -0.196702 2.037057 -0.287942
name2 date4 date5
0 -0.318310 0.358619 -0.243150
1 1.171024 0.277943 -1.584723
2 -0.546707 -1.951831 0.678125
3 -0.510261 -0.018574 -0.212684
4 1.929841 0.995625 -1.125044
I'd like to to keep all columns that have, for example, 'date' in their names. That is, I want to keep columns 'date1', 'date2', 'date3', 'date4', 'date5', etc. In some statistical packages I can use * to represent all possible characters and use a command like this:
keep date*
Is there an equivalent way of doing this in python?
Thanks very much for any help.
You can use the filter method. To do the equivalent of keep date*:
In [62]: df.filter(like='date')
Out[62]:
date1 date2 date3 date4 date5
0 0.091744 -0.431606 1.280286 0.263137 0.444550
1 0.688155 0.583918 0.957041 0.446047 1.654274
2 0.109004 0.608818 0.091498 0.940406 0.476479
3 -0.874016 1.312567 0.326480 1.213292 0.504049
4 -0.203515 -0.979086 0.458790 1.012296 -2.446310
The filter method has also a regex keyword, to do some more complex filtering.
Eg to drop all dates, you can provide a regex expression that says to not match a certain string: df.filter(regex="^(?!date).*$")
In the upcoming pandas (0.14), this functionality will also be provided in drop method, so this will be easier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With