Search for a partial string match in a data frame column from a list - Pandas - Python

Tags:

pandas

I have a list:

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

This code gave me an error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.

865

asked Jul 12 '16 15:07

Eric Coy

2 Answers

You can avoid the loop by joining your list of words to create a regex and use str.contains:

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work

So the regex pattern becomes: 'A1|B2|C3' and this will match anywhere in your strings that contain any of these strings

Example:

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

As to why it failed:

if df[df['COLUMN'].str.contains(mp)]

this line:

df[df['COLUMN'].str.contains(mp)]

returns a df masked by the boolean array of your inner str.contains, if doesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.

answered Oct 04 '22 13:10

EdChum

Pandas is actually amazing but I don't find it very easy to use. However it does have many functions designed to make life easy, including tools for searching through huge data frames.

Though it may not be a full solution to your problem, this may help set you off on the right foot. I have assumed that you know which column you are searching in, column A in my example.

import pandas as pd

df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
                   'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3')  # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')]   # Matching rows

The output:

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
2        This;D6;Row=bad100  foo

0     True
1     True
2    False
Name: A, dtype: bool

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo

answered Oct 04 '22 13:10

emmalg

Related questions
                            
                                what does read() in urlopen('http.....').read() do? [urllib]
                            
                                Keras: ImportError: No module named data_utils
                            
                                Stacked bar charts using python matplotlib for positive and negative values
                            
                                BCrypt. How to store salt with python3?
                            
                                Python: Is there a way to plot a "partial" surface plot with Matplotlib?
                            
                                NumPy Broadcasting: Calculating sum of squared differences between two arrays
                            
                                How to fill an area within a polygon in Python using matplotlib?
                            
                                socket.error: [Errno 102] Operation not supported on socket
                            
                                How to set xticks and yticks with my imshow plot?
                            
                                Venn3: How to reposition circles and labels?
                            
                                How to run multiple python file in a folder one after another [duplicate]
                            
                                RabbitMQ pika.exceptions.ConnectionClosed
                            
                                ImportError : cannot import name '_win32stdio'
                            
                                How do I put a circle with annotation in matplotlib?
                            
                                yield(x) vs. (yield(x)): parentheses around yield in python
                            
                                Pass estimator to custom score function via sklearn.metrics.make_scorer
                            
                                How to remove Python tools for Visual Studio (June 2016) update notification? It's already installed
                            
                                how can I translate efficiently a Java code to python? [closed]
                            
                                Array and __rmul__ operator in Python Numpy
                            
                                Efficient way to combine pandas data frames row-wise

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With