Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match both multi line and single line

I’m trying to get my head around some Regex (using Python 2.7) and have hit a confusing snag. It’s to do with the (.*) . I know that dot matches everything except for new line unless you use the tag re.DOTALL. But when I do use the tag, it includes too much. Here is the code with a few variations and results that I’ve tried:

import re
from urllib2 import urlopen
webpage = urlopen('http://trev.id.au/testfiles/rgxtxt.php').read()

# find the instances of pattern in the file
findPatHTMLComment = re.findall('<!--(.*)-->',webpage) 
foundItems = len(findPatHTMLComment) # how many instances where found?
# Print results
print "Found " + str(foundItems) + " matches. They are: "
listIterator = []
listIterator[:]=range(0,foundItems)
for i in listIterator:
    print "HTML_Comment["+ str(i) +"]: |" + findPatHTMLComment[i] + "| END HTML Comment"

This results in finding 3 matches as it doesn't find the multi-line comment sections.

Using:

findPatHTMLComment = re.findall('<!--(.*)-->',webpage,re.DOTALL)

Finds a single match using the first at the end of the document.

findPatHTMLComment = re.findall('<!--(.*)-->',webpage,re.MULTILINE)

Finds the same as the first one, only 3 out of the 5 comments that are in the file.

QUESTION: What is it that I should use in this instance as the regex? Could you explain it for me and others too?

Appreciate any guidance you can provide. Thanks and have a nice day.

EDIT: Include sample data that was at link in code above (will be removing sample data from server soon):

<html>
<!--[if lt IE 9 ]>
    <script type="text/javascript">
        jQuery(function ($) {
            function TopSearchIE9(input,inputBtn){
                var $topSearch=$(input);
                var $topSearchBtn=$(inputBtn);
                $topSearch.keydown(function(e) {
                    if (e.keyCode == 13) {
                        $topSearchBtn.trigger("click");
                        return false;
                    }
                });
            }
            TopSearchIE9(".J-txt-focus1",".J-txt-focus1-btn");
            TopSearchIE9(".J-txt-focus2",".J-txt-focus2-btn");
        });
    </script> 
<![endif]-->
<!--[if lt IE 10 ]>
    <style>
        .new-header-search .hdSch-txt{ width: 225px;}
        .new-header-search .hdSch-del{width: 0px; padding: 5px 0px;}
        .new-header-search .hdSch-del.del{background:none; padding: }
    </style>
<![endif]-->
<body>
    <!-- This is a text file with a number of items to allow testing of some regex methods. It has no actual meaning -->
    <div head1>Item heading for first item</div>
    <!--By the way, this is a comment in a block of HTML text.-->
    <div itembody>We can jump over the moon if we are fast enough, but we really shouldn't try it cause we may get a blood nose. When we do try and succeed it feels quite good.</div>
    <div head1>Item heading for second item</div>
    <div itembody>If this is showing, its the second body within the itembody div tags for this file</div>
    <div head1>Item heading for the third item</div>
    <div itembody>
        Going to add another div tag 
        <div highlight>
            and closing div tag
        </div> 
        in this body to see how it handles that.
    </div>
    <!-- The above itembody data should 
        have it's own div and closing div tags -->
    <div head1>Item heading for the fourth item</div>
    <div itembody>
        <p><a href="mailto:[email protected]">email fred</a> or phone him on +63 493 3382 3329 when you are ready to try more regex stuff.</p>
        <p>You can also check with Barney by <a href="mailto:[email protected]">emailing him</a> or phone him of +44 394 394 3992 if that is easier</p>
    </div>
    <!-- Thats all folks... -->
</body>

like image 987
Trevor Avatar asked Nov 27 '25 20:11

Trevor


1 Answers

But when I do use the tag, it includes too much.

* is a greedy operator meaning it will match as much as it can and still allow the remainder of the regular expression to match. You need to follow the * operator with ? for a non-greedy match which means "zero or more — preferably as few as possible".

re.findall('<!--(.*?)-->', webpage, re.DOTALL)
                   ↑

The re.MULTILINE flag is called multi-line because the anchors ^ and $ operate on multiple lines when implemented, which in this case using the multi-line modifier is redundant.

On another note, I would consider using BeautifulSoup for this task.

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html)
comments = soup.find_all(text=lambda text:isinstance(text, Comment))
like image 52
hwnd Avatar answered Nov 29 '25 10:11

hwnd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!