I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" . I have this with me, but it matches ":( (" <pre class="prettyprint"><code>bool( re.match("(:\()",str) ) </code></pre> I maybe missing something obvious here, and I'd like some help for this seemingly simple task.

I think it finally "clicked" exactly what you're asking about here. Take a look at the below: <pre class="prettyprint"><code>import re smiley_pattern = '^(:$|:$)+$' # matches only the smileys ":)" and ":(" def test_match(s): print 'Value: %s; Result: %s' % ( s, 'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.' ) should_match = [ ':)', # Single smile ':(', # Single frown ':):)', # Two smiles ':(:(', # Two frowns ':):(', # Mix of a smile and a frown ] should_not_match = [ '', # Empty string ':(foo', # Extraneous characters appended 'foo:(', # Extraneous characters prepended ':( :(', # Space between frowns ':( (', # Extraneous characters and space appended ':((' # Extraneous duplicate of final character appended ] print('The following should all match:') for x in should_match: test_match(x); print('') # Newline for output clarity print('The following should all not match:') for x in should_not_match: test_match(x); </code></pre> The problem with your original code is that your regex is wrong: <code>(:$)</code>. Let's break it down. The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying: <ul> <li> <code>(</code> begin a group <ul> <li> <code>:\(</code> ... do regex stuff ...</li> </ul> </li> <li>')' end the group</li> </ul> The <code>:</code> isn't a regex reserved character, so it's just a colon. The <code>\</code> is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says <ul> <li> <code>(</code> begin a group <ul> <li> <code>:</code> a colon character</li> <li> <code>\(</code> a left parenthesis character</li> </ul> </li> <li> <code>)</code> end the group</li> </ul> The regex I used is slightly more complex, but not bad. Let's break it down: <code>^(:\(|:$)+$</code>. <code>^</code> and <code>$</code> mean "the beginning of the line" and "the end of the line" respectively. Now we have ... <ul> <li> <code>^</code> beginning of line <ul> <li> <code>(:$|:$)+</code> ... do regex stuff ...</li> </ul> </li> <li> <code>$</code> end of line</li> </ul> ... so it only matches things that comprise the entire line, not simply occur in the middle of the string. We know that <code>(</code> and <code>)</code> denote a grouping. <code>+</code> means "one of more of these". Now we have: <ul> <li> <code>^</code> beginning of line</li> <li> <code>(</code> start a group <ul> <li> <code>:$|:$</code> ... do regex stuff ...</li> </ul> </li> <li> <code>)</code> end the group</li> <li> <code>+</code> match one or more of this</li> <li> <code>$</code> end of line</li> </ul> Finally, there's the <code>|</code> (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation: <ul> <li> <code>^</code> beginning of line</li> <li> <code>(</code> start a group <ul> <li> <code>:</code> a colon character</li> <li> <code>$</code> a left parenthesis character</li> </ul> </li> <li> <code>|</code> or <ul> <li> <code>:</code> a colon character</li> <li> <code>$</code> a right parenthesis character</li> </ul> </li> <li> <code>)</code> end the group</li> <li> <code>+</code> match one or more of this</li> <li> <code>$</code> end of line</li> </ul> I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

Capturing emoticons using regular expression in python

Tags:

python

regex

I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .

I have this with me, but it matches ":( ("

bool( re.match("(:\()",str) )

I maybe missing something obvious here, and I'd like some help for this seemingly simple task.

968

asked Jan 28 '13 20:01

coding_pleasures

1 Answers

I think it finally "clicked" exactly what you're asking about here. Take a look at the below:

import re

smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("

def test_match(s):
    print 'Value: %s; Result: %s' % (
        s,
        'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
    )

should_match = [
    ':)',   # Single smile
    ':(',   # Single frown
    ':):)', # Two smiles
    ':(:(', # Two frowns
    ':):(', # Mix of a smile and a frown
]
should_not_match = [
    '',         # Empty string
    ':(foo',    # Extraneous characters appended
    'foo:(',    # Extraneous characters prepended
    ':( :(',    # Space between frowns
    ':( (',     # Extraneous characters and space appended
    ':(('       # Extraneous duplicate of final character appended
]

print('The following should all match:')
for x in should_match: test_match(x);

print('')   # Newline for output clarity

print('The following should all not match:')
for x in should_not_match: test_match(x);

The problem with your original code is that your regex is wrong: (:\(). Let's break it down.

The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:

( begin a group
- :\( ... do regex stuff ...
')' end the group

The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says

( begin a group
- : a colon character
- \( a left parenthesis character
) end the group

The regex I used is slightly more complex, but not bad. Let's break it down: ^(:$|:$)+$.

^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...

^ beginning of line
- (:$|:$)+ ... do regex stuff ...
$ end of line

... so it only matches things that comprise the entire line, not simply occur in the middle of the string.

We know that ( and ) denote a grouping. + means "one of more of these". Now we have:

^ beginning of line
( start a group
- :$|:$ ... do regex stuff ...
) end the group
+ match one or more of this
$ end of line

Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:

^ beginning of line
( start a group
- : a colon character
- \( a left parenthesis character
| or
- : a colon character
- \) a right parenthesis character
) end the group
+ match one or more of this
$ end of line

I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

135

answered Sep 28 '22 04:09

Lyndsy Simon

Related questions
                            
                                Create a Python list filled with the same string over and over and a number that increases based on a variable.
                            
                                How do you protect yourself from missing comma in vertical string list in python?
                            
                                python, unittest, test a script with command line args
                            
                                Text File data parsing lines and output as columns
                            
                                One to one self relationship in SQLAlchemy
                            
                                Parsing a multi-line data file with Python [closed]
                            
                                What does pix[x, y] mean in Python
                            
                                get errors when import lxml.etree to python
                            
                                Python: setup.py missing: No such file or directory
                            
                                How can I catch a system suspend event in Python?
                            
                                Creating a multiple phone vCard using vObject
                            
                                Constantly monitor a program/process using Python
                            
                                Can i divide the models in different files in django
                            
                                lambda function in sorted dictionary list comprehension
                            
                                Automatic headers when opening a new python file with vim [duplicate]
                            
                                This field is required error in django
                            
                                Split a series on time gaps in pandas?
                            
                                python unittest - Using 'buffer' option to suppress stdout - how do I do it?
                            
                                Using yield with multiple ndb.get_multi_async
                            
                                Setting up Python with WSGI on Apache for a directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With