I've looked but seem to be coming up dry for an answer to the following question.
I have a pandas dataframe analogous to this (call it 'df'):
Type Set
1 theGreen Z
2 andGreen Z
3 yellowRed X
4 roadRed Y
I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which assigns a numerical coding variable (1) if the Type contains the string "Green", (0) otherwise.
Essentially, I'm trying to find a way of doing this:
df['color'] = np.where(df['Type'] == 'Green', 1, 0)
Except instead of the usual numpy operators (<,>,==,!=, etc.) I need a way of saying "in" or "contains". Is this possible? Any and all help appreciated!
Use str.contains
:
df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
Type Set color
1 theGreen Z 1
2 andGreen Z 1
3 yellowRed X 0
4 roadRed Y 0
Another solution with apply
:
df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
Type Set color
1 theGreen Z 1
2 andGreen Z 1
3 yellowRed X 0
4 roadRed Y 0
Second solution is faster, but doesn't work with NaN
in column Type
, then return error
:
TypeError: argument of type 'float' is not iterable
Timings:
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop
In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With