Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use \b word boundary in pandas str.contains?

Is there an equivalent when using str.contains?

the following code is mistakenly listing "Said Business School" in the category because of 'Sa.' If I could create a wordboundary it would solve the problem. Putting a space after messes this up. I am using pandas, which are the dfs. I know I can use regex, but just curious if i can use strings to make it faster

gprivate_n = ('Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation')
df.loc[df[df.Name.str.contains('{0}'.format(gprivate_n))].index, "Private"] = 1 
like image 622
user3314418 Avatar asked Mar 12 '14 17:03

user3314418


People also ask

What is \b word boundary?

Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).

What does \b mean in regex python?

The \b metacharacter matches at the beginning or end of a word.

How do I check if a string contains a substring panda?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series. str. contains("substring") .


2 Answers

This is just the same old Python issue in regexes where '\b' should be passed either as raw-string r'\b...'. Or less desirably, double-escaping ('\\b').

So your regex should be:

gprivate_n = (r'\b(Co|Inc|Llc|Group|Ltd|Corp|Plc|Sa |Insurance|Ag|As|Media|&|Corporation)')
like image 146
smci Avatar answered Oct 19 '22 21:10

smci


A word boundary is not a character, so you can't find it with .contains. You need to either use regex or split the strings into words and then check for membership of each of those words in the set you currently have defined in gprivate_n.

like image 29
RexE Avatar answered Oct 19 '22 20:10

RexE