Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if String in List of Strings is in Pandas DataFrame Column

I am new to Python and have a question regarding matching strings in a list to a column in a df.

When I run the following commands, I would like a new column named "Match" to be created, and if there is a match between the string in the list and the string in the column the value in the "Match" column and corresponding row should be True, if no match, then False. The desired outcome would be False, False, True, False, False. Since the string "Honda" is not an exact match of "Honda Civic" it should not be True. Same with "Toy" is not the exact match of "Toyota Corolla".

Creating df:

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

I then create a list of the values I would like to search for, joined by |.

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)

Here I have tried the str.match command and are given True, True, True, False, False.

df['Match'] = df["Brand"].str.match(pattern, na=False)

Here I have created a loop using the == operator and are given False, False, False, False, False.

for i in range(0,len(pattern)):
    df['Match'] = df['Brand'] == pattern[i]

Thank you for the help!

like image 387
BKlassen Avatar asked May 16 '19 13:05

BKlassen


1 Answers

If need match values in list, use Series.isin:

df['Match'] = df["Brand"].isin(search_for_these_values)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

Solution with match is used for check substrings, so different output.

Alternative solution for match substrings with Series.str.contains and parameter na=False:

df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789   True
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987  False
4             NaN  29000        DEF 456  False

EDIT:

For test values in substrings is possible use list comprehension with loop by values in search_for_these_values and test match by in with any for return at least one True:

df['Match'] = [any(x in z for z in search_for_these_values) 
                                if x == x 
                                else False 
                                for x in df["Brand"]]
print (df)

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123  False
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False
like image 99
jezrael Avatar answered Nov 04 '22 10:11

jezrael