Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Question

Hey there, I love regular expressions, but I'm just not good at them at all.

I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.

examples: omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol

I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?

Thanks all.

(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)

Srikar Appalaraju · Accepted Answer

FIRST APPROACH -

Well, using regular expression(s) you could do like so -

import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')

etc.

Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.

SECOND APPROACH -

Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. SequenceMatcher tries to compute a "human-friendly diff" between two sequences. The fundamental notion is the longest contiguous & junk-free matching subsequence.

import difflib as dl
x   = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y   = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6: 
    print 'Match!'
else:
    print 'Sorry!'

According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Tags:

python

string-matching

regex

apexdodge

1 Answers

Srikar Appalaraju

Recent Activity

Donate For Us

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Tags:

python

string-matching

regex

apexdodge

1 Answers

Srikar Appalaraju

Related questions

Recent Activity

Donate For Us