Given a string like this: <pre class="prettyprint"><code><a href="http://blah.com/foo/blah">This is the foo link</a> </code></pre> ... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this: <pre class="prettyprint"><code><a href="http://blah.com/foo/blah">This is the foo link</a> </code></pre> However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href. So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags? Note: I promise that the HTML in question will never be anything pathological like: <pre class="prettyprint"><code><img title="Haha! Here are some angle brackets to screw you up: ><" /> </code></pre> Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex. Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything." Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface. So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this: <pre class="prettyprint"><code>#!/usr/bin/env perl use strict; use warnings; use feature ':5.10'; use Template::Refine::Fragment; my $frag = Template::Refine::Fragment->new_from_string('Hello, world. <a href="http://foo.com/">This is a test of foo finding.</a> Here is another foo.'); say $frag->process( simple_replace { my $n = shift; my $text = $n->textContent; $text =~ s/foo/<foo>/g; return XML::LibXML::Text->new($text); } '//text()', )->render; </code></pre> This outputs: <pre class="prettyprint"><code>Hello, world. <a href="http://foo.com/">This is a test of &lt;foo&gt; finding.</a> Here is another &lt;foo&gt;. </code></pre> Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free". Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the <code>new_from_dom</code> constructor.)

How do I match text in HTML that's not inside tags?

Tags:

html

regex

perl

Given a string like this:

<a href="http://blah.com/foo/blah">This is the foo link</a>

... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:

<a href="http://blah.com/foo/blah">This is the <b>foo</b> link</a>

However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.

So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?

Note: I promise that the HTML in question will never be anything pathological like:

<img title="Haha! Here are some angle brackets to screw you up: ><" />

Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.

Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."

Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.

So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

844

asked Feb 22 '09 03:02

raldi

2 Answers

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:

s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

104

answered Oct 03 '22 15:10

David Z

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use Template::Refine::Fragment;

my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world.  <a href="http://foo.com/">This is a test of foo finding.</a>  Here is another foo.');

say $frag->process(
    simple_replace {
        my $n = shift;
        my $text = $n->textContent;
        $text =~ s/foo/<foo>/g;
        return XML::LibXML::Text->new($text);
    } '//text()',
)->render;

This outputs:

<p>Hello, world.  <a href="http://foo.com/">This is a test of &lt;foo&gt; finding.</a>  Here is another &lt;foo&gt;.</p>

Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".

Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

answered Oct 03 '22 15:10

jrockway

Related questions
                            
                                How can I send data from swift to javascript and display them in my web view?
                            
                                Remove gap between columns in flex layout [duplicate]
                            
                                Disable keyboard with HTML input and allow scanners
                            
                                How to identify what keyboard language is with jquery
                            
                                Inconsistency for displaying <li><img /></li>
                            
                                How to add text on image using JavaScript and Canvas
                            
                                How can I disable .onclick for element's children?
                            
                                How to align <button> inline with text?
                            
                                how can I reorder HTML using media queries? [duplicate]
                            
                                How to check, the selected text is bold or not (contenteditable)
                            
                                CSS safe area attributes doesn't work on iPhone X
                            
                                sizeof(): Parameter must be an array or an object that implements Countable
                            
                                Font Awesome Icons in Offline
                            
                                Importing jQuery plugin into Angular 2+ Refused to execute script because its MIME type ('text/html') is not executable
                            
                                Angular 6,7 How to apply default theme color to mat-sidenav background?
                            
                                How to extend HTML attributes in React with Typescript
                            
                                Fix dropdown auto open in tailwind ui navbar component with vue.js
                            
                                HTML input date, how to decrease space between date and icon?
                            
                                How to prevent resize and maximize of Javascript window
                            
                                Adding HTML to my RSS/Atom feed in Rails

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With