I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?). I had been looking for something like <code>contains</code> or <code>isin</code> for <code>nd.arrays</code> / <code>pd.series</code>, but got no luck. This frustrated me quite a bit, as I was already checking the columns of my <code>DataFrame</code> for occurrences of specific string patterns, as in: <pre class="prettyprint"><code>hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text')) df_cln= df[hp] </code></pre> However, no matter how I banged my head, I could not apply <code>.str.contains()</code> to the object returned by<code>df.columns</code> - which is an <code>Index</code> - nor the one returned by <code>df.columns.values</code> - which is an <code>ndarray</code>. This works fine for what is returned by the "slicing" operation <code>df[column_name]</code>, i.e. a <code>Series</code>, though. My first solution involved a <code>for</code> loop and the creation of a help list: <pre class="prettyprint"><code>ll = [] for a in df.columns: if a.startswith('start_exp1') | a.startswith('start_exp2'): ll.append(a) df[ll] </code></pre> (one could apply any of the <code>str</code> functions, of course) Then, I found the <code>map</code> function and got it to work with the following code: <pre class="prettyprint"><code>import re sel = df.columns.map(lambda x: bool(re.search('your_regex',x)) df[df.columns[sel]] </code></pre> Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the <code>str</code> data type returned by the iteration. I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye. I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction. Thanks, Michele EDIT : I just found the <code>Index</code> method <code>Index.to_series()</code>, which returns - ehm - a <code>Series</code> to which I could apply <code>.str.contains('whatever')</code>. However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of <code>Index.to_series().str</code> to the <code>re.search()</code> function..

Select column by partial string, can simply be done, via: <pre class="prettyprint"><code>df.filter(like='hello') # select columns which contain the word hello </code></pre> And to select rows by partial string match, you can pass axis=0 to filter: <pre class="prettyprint"><code>df.filter(like='hello', axis=0) </code></pre>

How to select DataFrame columns based on partial matching?

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).

I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.

This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:

hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]

However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.

My first solution involved a for loop and the creation of a help list:

ll = []
for a in df.columns:
    if a.startswith('start_exp1') | a.startswith('start_exp2'):
    ll.append(a)
df[ll]

(one could apply any of the str functions, of course)

Then, I found the map function and got it to work with the following code:

import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]

Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.

I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.

I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.

Thanks, Michele

EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever'). However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..

How do you match two columns in a data frame?

To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.

Does ILOC select rows or columns?

DataFrame. iloc[] is an index-based to select rows and/or columns in pandas. It accepts a single index, multiple indexes from the list, indexes by a range, and many more. One of the main advantages of DataFrame is its ease of use.

How do you select part of a data frame?

To select a single column, use square brackets [] with the column name of the column of interest.

Select column by partial string, can simply be done, via:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string match, you can pass axis=0 to filter:

df.filter(like='hello', axis=0)

Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):

In [1]: df
Out[1]: 
   x  y  z
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9

In [2]: df.columns.to_series().str.contains('x')
Out[2]: 
x     True
y    False
z    False
dtype: bool

In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]: 
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))

I think df.keys().tolist() is the thing you're searching for.

A tiny example:

from pandas import DataFrame as df

d = df({'somename': [1,2,3], 'othername': [4,5,6]})

names = d.keys().tolist()

for n in names:
    print n
    print type(n)

Output:

othername
type 'str'

somename
type 'str'

Then with the strings you got, you can do any string operation you want.

How to select DataFrame columns based on partial matching?

Tags:

python

pandas

Michele Ancis

People also ask

3 Answers

Philipp Schwarz

Robert Smith

Geeocode

Recent Activity

Donate For Us

How to select DataFrame columns based on partial matching?

Tags:

python

pandas

Michele Ancis

People also ask

3 Answers

Philipp Schwarz

Robert Smith

Geeocode

Related questions

Recent Activity

Donate For Us