Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing emoticons using regular expression in python

Tags:

python

regex

I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .

I have this with me, but it matches ":( ("

bool( re.match("(:\()",str) ) 

I maybe missing something obvious here, and I'd like some help for this seemingly simple task.

like image 968
coding_pleasures Avatar asked Jan 28 '13 20:01

coding_pleasures


People also ask

How do you get emojis in Python?

Using emoji module: Emojis can also be implemented by using the emoji module provided in Python. To install it run the following in the terminal. emojize() function requires the CLDR short name to be passed in it as the parameter. It then returns the corresponding emoji.

How do I remove emoticons from tweets in Python?

To remove the emojis, we set the parameter no_emoji to True .

Can you use regex in Python?

Python has a module named re to work with regular expressions. To use it, we need to import the module. The module defines several functions and constants to work with RegEx.


1 Answers

I think it finally "clicked" exactly what you're asking about here. Take a look at the below:

import re

smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("

def test_match(s):
    print 'Value: %s; Result: %s' % (
        s,
        'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
    )

should_match = [
    ':)',   # Single smile
    ':(',   # Single frown
    ':):)', # Two smiles
    ':(:(', # Two frowns
    ':):(', # Mix of a smile and a frown
]
should_not_match = [
    '',         # Empty string
    ':(foo',    # Extraneous characters appended
    'foo:(',    # Extraneous characters prepended
    ':( :(',    # Space between frowns
    ':( (',     # Extraneous characters and space appended
    ':(('       # Extraneous duplicate of final character appended
]

print('The following should all match:')
for x in should_match: test_match(x);

print('')   # Newline for output clarity

print('The following should all not match:')
for x in should_not_match: test_match(x);

The problem with your original code is that your regex is wrong: (:\(). Let's break it down.

The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:

  • ( begin a group
    • :\( ... do regex stuff ...
  • ')' end the group

The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says

  • ( begin a group
    • : a colon character
    • \( a left parenthesis character
  • ) end the group

The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:\))+$.

^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...

  • ^ beginning of line
    • (:\(|:\))+ ... do regex stuff ...
  • $ end of line

... so it only matches things that comprise the entire line, not simply occur in the middle of the string.

We know that ( and ) denote a grouping. + means "one of more of these". Now we have:

  • ^ beginning of line
  • ( start a group
    • :\(|:\) ... do regex stuff ...
  • ) end the group
  • + match one or more of this
  • $ end of line

Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:

  • ^ beginning of line
  • ( start a group
    • : a colon character
    • \( a left parenthesis character
  • | or
    • : a colon character
    • \) a right parenthesis character
  • ) end the group
  • + match one or more of this
  • $ end of line

I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

like image 135
Lyndsy Simon Avatar answered Sep 28 '22 04:09

Lyndsy Simon