I want to match the url within strings like
u1 = "Check this out http://www.cnn.com/stuff lol"
u2 = "see http://www.cnn.com/stuff2"
u3 = "http://www.espn.com/stuff3 is interesting"
Something like the following works, but it's cumbersome because I have to repeat the whole pattern
re.findall("[^ ]*.cnn.[^ ]*|[^ ]*.espn.[^ ]*", u1)
Particularly, in my real code I wanted to match a much larger number of web sites. Ideally I can do something similar to
re.findall("[^ ]*.cnn|espn.[^ ]*", u1)
but of course it doesn't work now because I am not specifying the web site name correctly. How can this be done better? Thanks.
Indicates that a match can be one of the two terms on either side of the pipe. Used at the beginning of an expression, denotes where a match should begin.
A pipe symbol allows regular expression components to be logically ORed. For example, the following regular expression matches lines that start with the word "Germany" or the word "Netherlands". Note that parentheses are used to group the two expressive components.
The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.
Non-capturing groups allow you to group characters without having that group also be returned as a match.
cnn|espn
becomes (?:cnn|espn)
:
re.findall("[^ ]*\.(?:cnn|espn)\.[^ ]*", u1)
Also note that .
is a regex special character (it will match any character except newline). To match the .
character itself, you must escape it with \
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With