How can I remove an entire HTML tag (and its contents) by its class using a regex?

I am not very good with Regex but I am learning.

I would like to remove some html tag by the class name. This is what I have so far :

<div class="footer".*?>(.*?)</div>

The first .*? is because it might contain other attribute and the second is it might contain other html stuff.

What am I doing wrong? I have try a lot of set without success.

Update

Inside the DIV it can contain multiple line and I am playing with Perl regex.

How to remove HTML tag regex?

Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g; Test it!

How do you remove tags in HTML?

For HTML tags, you can press Alt+Enter and select Remove tag instead of removing an opening tag and then a closing tag.

How do I remove a tag from a string?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

How do you remove HTML tags in Python?

Remove HTML tags from string in python Using the Beautifulsoup Module. Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );

for my $node ( $tree->findnodes( '//*[@class="footer"]' ) ) {
    $node->replace_with_content;   # delete element, but not the children
}

print $tree->as_HTML;

You will also want to allow for other things before class in the div tag

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

<div>
    <div class="footer">
        <div>Hi!</div>
    </div>
</div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

Pseudocode that should map closely to XML::DOM:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}

Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

How can I remove an entire HTML tag (and its contents) by its class using a regex?

Tags:

html

regex

filter

perl

Update

Patrick Desjardins

People also ask

2 Answers

Yanick

Chris Marasti-Georg

Recent Activity

Donate For Us

How can I remove an entire HTML tag (and its contents) by its class using a regex?

Tags:

html

regex

filter

perl

Update

Patrick Desjardins

People also ask

2 Answers

Yanick

Chris Marasti-Georg

Related questions

Recent Activity

Donate For Us