I'd appreciate your help. I have a pandas dataframe. I want to search 3 columns of the dataframe using a regular expression, then return all rows that meet the search criteria, sorted by one of my columns. I would like to write this as a function so I can implement this logic with other criteria if possible, but am not quite sure how to do this.
For example, I know how pull the results of a search thusly (with col1 being a column name):
idx1 = df.col1.str.contains(r'vhigh|high', flags=re.IGNORECASE, regex=True, na=False)
print df[~idx1]
but I can't figure out how to take this type of action, and perform it with multiple columns and then sort. Anyone have any tips?
Import required modules. Assign data frame. Create pattern-mixer object with the data frame as a constructor argument. Call find() method of the pattern-mixer object to identify various patterns in the data frame.
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
In Pandas, DataFrame. loc[] property is used to get a specific cell value by row & label name(column name).
You can use apply
to make the code more concise. For example, given this DataFrame:
df = pd.DataFrame(
{
'col1': ['vhigh', 'low', 'vlow'],
'col2': ['eee', 'low', 'high'],
'val': [100,200,300]
}
)
print df
Input:
col1 col2 val
0 vhigh eee 100
1 low low 200
2 vlow high 300
You can select all the rows that contain the strings vhigh
or high
in columns col1
or col2
as follow:
mask = df[['col1', 'col2']].apply(
lambda x: x.str.contains(
'vhigh|high',
regex=True
)
).any(axis=1)
print df[mask]
The apply
function applies the contains
function on each column (since by default axis=0
). The any
function returns a Boolean mask, with element True indicating that at least one of the columns met the search criteria. This can then be used to perform selection on the original DataFrame.
Output:
col1 col2 val
0 vhigh eee 100
2 vlow high 300
Then, to sort the result by a column, e.g. the val
column, you could simply do:
df[mask].sort('val')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With