Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for a partial string match in a data frame column from a list - Pandas - Python

Tags:

python

pandas

I have a list:

things = ['A1','B2','C3']

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

This code gave me an error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.

like image 865
Eric Coy Avatar asked Jul 12 '16 15:07

Eric Coy


People also ask

How do I search for a specific string in a column in pandas?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not.

How do you match a partial string in Python?

Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.

How do you check if a DataFrame column contains a string?

Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.

How do I match a string in pandas?

str. match() function is used to determine if each string in the underlying data of the given series object matches a regular expression.


2 Answers

You can avoid the loop by joining your list of words to create a regex and use str.contains:

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

should just work

So the regex pattern becomes: 'A1|B2|C3' and this will match anywhere in your strings that contain any of these strings

Example:

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

As to why it failed:

if df[df['COLUMN'].str.contains(mp)]

this line:

df[df['COLUMN'].str.contains(mp)]

returns a df masked by the boolean array of your inner str.contains, if doesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.

like image 85
EdChum Avatar answered Oct 04 '22 13:10

EdChum


Pandas is actually amazing but I don't find it very easy to use. However it does have many functions designed to make life easy, including tools for searching through huge data frames.

Though it may not be a full solution to your problem, this may help set you off on the right foot. I have assumed that you know which column you are searching in, column A in my example.

import pandas as pd

df = pd.DataFrame({'A' : pd.Categorical(['Wow;Here;This=A1;10001;0', 'Another;C3;Row=Great;100', 'This;D6;Row=bad100']),
                   'B' : 'foo'})
print df #Original data frame
print
print df['A'].str.contains('A1|B2|C3')  # Boolean array showing matches for col A
print
print df[df['A'].str.contains('A1|B2|C3')]   # Matching rows

The output:

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
2        This;D6;Row=bad100  foo

0     True
1     True
2    False
Name: A, dtype: bool

                          A    B
0  Wow;Here;This=A1;10001;0  foo
1  Another;C3;Row=Great;100  foo
like image 29
emmalg Avatar answered Oct 04 '22 13:10

emmalg