Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract string with Python re.match

Tags:

import re str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"  str2=re.match("[a-zA-Z]*//([a-zA-Z]*)",str) print str2.group()  current result=> error expected => wwwqqqzzz 

I want to extract the string wwwqqqzzz. How I do that?

Maybe there are a lot of dots, such as:

"whatever..s#[email protected].:af//wwww.xxx.yn.zsdfsd.asfds.f.ds.fsd.whatever/123.dfiid" 

In this case, I basically want the stuff bounded by // and /. How do I achieve that?

One additional question:

import re str="xxx.yyy.xxx:80"  m = re.search(r"([^:]*)", str) str2=m.group(0) print str2 str2=m.group(1) print str2 

Seems that m.group(0) and m.group(1) are the same.

like image 893
runcode Avatar asked Nov 16 '12 20:11

runcode


People also ask

Does re match return a string?

Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none.

How do you match a string in Python?

In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.


1 Answers

match tries to match the entire string. Use search instead. The following pattern would then match your requirements:

m = re.search(r"//([^/]*)", str) print m.group(1) 

Basically, we are looking for /, then consume as many non-slash characters as possible. And those non-slash characters will be captured in group number 1.

In fact, there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind:

m = re.search(r"(?<=//)[^/]*", str) print m.group() 

Lookarounds are not included in the actual match, hence the desired result.

This (or any other reasonable regex solution) will not remove the .s immediately. But this can easily be done in a second step:

m = re.search(r"(?<=//)[^/]*", str) host = m.group() cleanedHost = host.replace(".", "") 

That does not even require regular expressions.

Of course, if you want to remove everything except for letters and digits (e.g. to turn www.regular-expressions.info into wwwregularexpressionsinfo) then you are better off using the regex version of replace:

cleanedHost = re.sub(r"[^a-zA-Z0-9]+", "", host) 
like image 157
Martin Ender Avatar answered Sep 30 '22 18:09

Martin Ender