I'm trying to catch if one letter that appears twice in a string using RegEx (or maybe there's some better ways?), for example my string is:
ugknbfddgicrmopn
The output would be:
dd
However, I've tried something like:
re.findall('[a-z]{2}', 'ugknbfddgicrmopn')
but in this case, it returns:
['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn'] # the except output is `['dd']`
I also have a way to get the expect output:
>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
... if tmp != i:
... tmp = i
... continue
... l.append(i*2)
...
...
>>> l
['dd']
>>>
But that's too complex...
If it's 'abbbcppq'
, then only catch:
abbbcppq
^^ ^^
So the output is:
['bb', 'pp']
Then, if it's 'abbbbcppq'
, catch bb
twice:
abbbbcppq
^^^^ ^^
So the output is:
['bb', 'bb', 'pp']
You need use capturing group based regex and define your regex as raw string.
>>> re.search(r'([a-z])\1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])\1', 'abbbbcppq')]
['bb', 'bb', 'pp']
or
>>> [i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]
['bb', 'bb', 'pp']
Note that , re.findall
here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0]
.
As a Pythonic way You can use zip
function within a list comprehension:
>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']
If you are dealing with large string you can use iter()
function to convert the string to an iterator and use itertols.tee()
to create two independent iterator, then by calling the next
function on second iterator consume the first item and use call the zip
class (in Python 2.X use itertools.izip()
which returns an iterator) with this iterators.
>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']
RegEx
recipe:# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop
# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop
After your last edit as mentioned in comment if you want to only match one pair of b
in strings like "abbbcppq"
you can use finditer()
which returns an iterator of matched objects, and extract the result with group()
method:
>>> import re
>>>
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])\1',s,re.I)]
['bb', 'pp']
Note that re.I
is the IGNORECASE flag which makes the RegEx match the uppercase letters too.
Using back reference, it is very easy:
import re
p = re.compile(ur'([a-z])\1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']
For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times
It is pretty easy without regular expressions:
In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With