I have a large dataframe (5000 x 12039) and I want to get the column name that matches a numpy array.
For example, if I have the table
m1lenhr m1lenmin m1citywt m1a12a cm1age cm1numb m1b1a m1b1b m1b12a m1b12b ... kind_attention_scale_10 kind_attention_scale_22 kind_attention_scale_21 kind_attention_scale_15 kind_attention_scale_18 kind_attention_scale_19 kind_attention_scale_25 kind_attention_scale_24 kind_attention_scale_27 kind_attention_scale_23
challengeID
1 0.130765 40.0 202.485367 1.893256 27.0 1.0 2.0 0.0 2.254198 2.289966 ... 0 0 0 0 0 0 0 0 0 0
2 0.000000 40.0 45.608219 1.000000 24.0 1.0 2.0 0.0 2.000000 3.000000 ... 0 0 0 0 0 0 0 0 0 0
3 0.000000 35.0 39.060299 2.000000 23.0 1.0 2.0 0.0 2.254198 2.289966 ... 0 0 0 0 0 0 0 0 0 0
4 0.000000 30.0 22.304855 1.893256 22.0 1.0 3.0 0.0 2.000000 3.000000 ... 0 0 0 0 0 0 0 0 0 0
5 0.000000 25.0 35.518272 1.893256 19.0 1.0 1.0 6.0 1.000000 3.000000 ... 0
I want to do this:
x = [40.0, 40.0, 35.0, 30.0, 25.0]
find_column(x)
and have find_column(x)
return m1lenmin
Columns attribute of the dataframe returns the column labels of the dataframe. You can get the column names as an array by using the . columns. values property of the dataframe.
To access the names of a Pandas dataframe, we can the method columns(). For example, if our dataframe is called df we just type print(df. columns) to get all the columns of the Pandas dataframe. After this, we can work with the columns to access certain columns, rename a column, and so on.
Approach #1
Here's one vectorized approach leveraging NumPy broadcasting
-
df.columns[(df.values == np.asarray(x)[:,None]).all(0)]
Sample run -
In [367]: df
Out[367]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6
In [368]: x = df.iloc[:,2].values.tolist()
In [369]: x
Out[369]: [2, 3, 2, 4, 0]
In [370]: df.columns[(df.values == np.asarray(x)[:,None]).all(0)]
Out[370]: Int64Index([2], dtype='int64')
Approach #2
Alternatively, here's another using the concept of views
-
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])
out = np.flatnonzero(df1D_arr==x1D)
Sample run -
In [442]: df
Out[442]:
0 1 2 3 4 5 6 7 8 9
0 7 1 2 6 2 1 7 2 0 6
1 5 4 3 3 2 1 1 1 5 5
2 7 7 2 2 5 4 6 6 5 7
3 0 5 4 1 5 7 8 2 2 4
4 7 1 0 4 5 4 3 2 8 6
In [443]: x = df.iloc[:,5].values.tolist()
In [444]: df1D_arr, x1D = view1D(df.values.T,np.asarray(x)[None])
In [445]: np.flatnonzero(df1D_arr==x1D)
Out[445]: array([5])
Try this:
In [91]: x = np.array(x)
In [94]: df.apply(lambda col: col.eq(x).all())
Out[94]:
m1lenhr False
m1lenmin True
m1citywt False
m1a12a False
cm1age False
cm1numb False
m1b1a False
m1b1b False
m1b12a False
m1b12b False
dtype: bool
In [95]: df.columns[df.apply(lambda col: col.eq(x).all()).values]
Out[95]: Index(['m1lenmin'], dtype='object')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With