Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional data selection with text string data in pandas dataframe

I've looked but seem to be coming up dry for an answer to the following question.

I have a pandas dataframe analogous to this (call it 'df'):

        Type              Set
    1   theGreen          Z
    2   andGreen          Z           
    3   yellowRed         X
    4   roadRed           Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which assigns a numerical coding variable (1) if the Type contains the string "Green", (0) otherwise.

Essentially, I'm trying to find a way of doing this:

   df['color'] = np.where(df['Type'] == 'Green', 1, 0)

Except instead of the usual numpy operators (<,>,==,!=, etc.) I need a way of saying "in" or "contains". Is this possible? Any and all help appreciated!

like image 646
Cole Robertson Avatar asked Nov 15 '16 15:11

Cole Robertson


1 Answers

Use str.contains:

df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Another solution with apply:

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Second solution is faster, but doesn't work with NaN in column Type, then return error:

TypeError: argument of type 'float' is not iterable

Timings:

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)  

In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop

In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop
like image 51
jezrael Avatar answered Sep 29 '22 07:09

jezrael