I want to match the url within strings like <pre class="prettyprint"><code>u1 = "Check this out http://www.cnn.com/stuff lol" u2 = "see http://www.cnn.com/stuff2" u3 = "http://www.espn.com/stuff3 is interesting" </code></pre> Something like the following works, but it's cumbersome because I have to repeat the whole pattern <pre class="prettyprint"><code>re.findall("[^ ]*.cnn.[^ ]*|[^ ]*.espn.[^ ]*", u1) </code></pre> Particularly, in my real code I wanted to match a much larger number of web sites. Ideally I can do something similar to <pre class="prettyprint"><code>re.findall("[^ ]*.cnn|espn.[^ ]*", u1) </code></pre> but of course it doesn't work now because I am not specifying the web site name correctly. How can this be done better? Thanks.

Non-capturing groups allow you to group characters without having that group also be returned as a match. <code>cnn|espn</code> becomes <code>(?:cnn|espn)</code>: <pre class="prettyprint"><code>re.findall("[^ ]*\.(?:cnn|espn)\.[^ ]*", u1) </code></pre> Also note that <code>.</code> is a regex special character (it will match any character except newline). To match the <code>.</code> character itself, you must escape it with <code>\</code>.

How to use the pipe operator as part of a regular expression?

Tags:

python

regex

I want to match the url within strings like

u1 = "Check this out http://www.cnn.com/stuff lol"
u2 = "see http://www.cnn.com/stuff2"
u3 = "http://www.espn.com/stuff3 is interesting"

Something like the following works, but it's cumbersome because I have to repeat the whole pattern

re.findall("[^ ]*.cnn.[^ ]*|[^ ]*.espn.[^ ]*", u1)

Particularly, in my real code I wanted to match a much larger number of web sites. Ideally I can do something similar to

re.findall("[^ ]*.cnn|espn.[^ ]*", u1)

but of course it doesn't work now because I am not specifying the web site name correctly. How can this be done better? Thanks.

413

asked Apr 24 '11 21:04

ceiling cat

1 Answers

Non-capturing groups allow you to group characters without having that group also be returned as a match.

cnn|espn becomes (?:cnn|espn):

re.findall("[^ ]*\.(?:cnn|espn)\.[^ ]*", u1)

Also note that . is a regex special character (it will match any character except newline). To match the . character itself, you must escape it with \.

answered Sep 22 '22 01:09

Ignacio Vazquez-Abrams

Related questions
                            
                                Problems installing PyCurl on python2.7.0+
                            
                                Processing messages from a child process thorough stderr and stdout with Python
                            
                                What are the various Python CMS's and their statuses?
                            
                                most efficient way to find partial string matches in large file of strings (python)
                            
                                Many-to-many declarative SQLAlchemy definition for users, groups, and roles
                            
                                Why is it not possible to get a Py_buffer from an array object?
                            
                                Grouping a series in Python
                            
                                Why does refs increase 2 for every new object in Python?
                            
                                How to color surface with stronger contrast
                            
                                Why is an instance of webapp.WSGIApplication always defined as a global variable in google app engine code?
                            
                                Where to Put Python Utils Folder?
                            
                                Using threading to keep FTP control port alive
                            
                                Python - Easiest way to scrape text from list of URLs using BeautifulSoup
                            
                                Could python have suffix-based number notation for engineering purposes?
                            
                                Python Raytracing
                            
                                Python: threading + lock slows my app down considerably
                            
                                Django Make ContentType Not Required
                            
                                exe error with cx_freeze
                            
                                is there a nice "python conventions and best practices" summary anywhere?
                            
                                How to add timeout to Deferred from Twisted's deferToThread API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With