I have a list:
things = ['A1','B2','C3']
I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')
I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:
import re
for_new_df =[]
for x in df['COLUMN']:
for mp in things:
if df[df['COLUMN'].str.contains(mp)]:
for_new_df.append(mp) #This won't save the whole row - help here too, please.
This code gave me an error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.
Using “contains” to Find a Substring in a Pandas DataFrame The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not.
Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.
Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.
str. match() function is used to determine if each string in the underlying data of the given series object matches a regular expression.
You can avoid the loop by joining your list of words to create a regex and use str.contains
:
pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]
should just work
So the regex pattern becomes: 'A1|B2|C3'
and this will match anywhere in your strings that contain any of these strings
Example:
In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]
Out[65]:
a
0 Wow;Here;This=A1;10001;0
1 B2
As to why it failed:
if df[df['COLUMN'].str.contains(mp)]
this line:
df[df['COLUMN'].str.contains(mp)]
returns a df masked by the boolean array of your inner str.contains
, if
doesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.
Pandas is actually amazing but I don't find it very easy to use. However it does have many functions designed to make life easy, including tools for searching through huge data frames.
Though it may not be a full solution to your problem, this may help set you off on the right foot. I have assumed that you know which column you are searching in, column A in my example.
import pandas as pd
df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3') # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')] # Matching rows
The output:
A B
0 Wow;Here;This=A1;10001;0 foo
1 Another;C3;Row=Great;100 foo
2 This;D6;Row=bad100 foo
0 True
1 True
2 False
Name: A, dtype: bool
A B
0 Wow;Here;This=A1;10001;0 foo
1 Another;C3;Row=Great;100 foo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With