Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search and filter pandas dataframe with regular expressions

I'd appreciate your help. I have a pandas dataframe. I want to search 3 columns of the dataframe using a regular expression, then return all rows that meet the search criteria, sorted by one of my columns. I would like to write this as a function so I can implement this logic with other criteria if possible, but am not quite sure how to do this.

For example, I know how pull the results of a search thusly (with col1 being a column name):

idx1 = df.col1.str.contains(r'vhigh|high', flags=re.IGNORECASE, regex=True, na=False)

print df[~idx1]

but I can't figure out how to take this type of action, and perform it with multiple columns and then sort. Anyone have any tips?

like image 586
Daina Avatar asked Sep 16 '15 16:09

Daina


People also ask

How do you find the pattern in a data frame?

Import required modules. Assign data frame. Create pattern-mixer object with the data frame as a constructor argument. Call find() method of the pattern-mixer object to identify various patterns in the data frame.

How do I filter specific rows from a DataFrame?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.

How do I search for a specific cell in pandas?

In Pandas, DataFrame. loc[] property is used to get a specific cell value by row & label name(column name).


1 Answers

You can use apply to make the code more concise. For example, given this DataFrame:

df = pd.DataFrame(
    {
        'col1': ['vhigh', 'low', 'vlow'],
        'col2': ['eee', 'low', 'high'],
        'val': [100,200,300]
    }
)
print df

Input:

    col1  col2  val
0  vhigh   eee  100
1    low   low  200
2   vlow  high  300

You can select all the rows that contain the strings vhigh or high in columns col1 or col2 as follow:

mask = df[['col1', 'col2']].apply(
    lambda x: x.str.contains(
        'vhigh|high',
        regex=True
    )
).any(axis=1)
print df[mask]

The apply function applies the contains function on each column (since by default axis=0). The any function returns a Boolean mask, with element True indicating that at least one of the columns met the search criteria. This can then be used to perform selection on the original DataFrame.

Output:

    col1  col2  val
0  vhigh   eee  100
2   vlow  high  300

Then, to sort the result by a column, e.g. the val column, you could simply do:

df[mask].sort('val')
like image 167
YS-L Avatar answered Sep 20 '22 15:09

YS-L