regexp: match all but every (.*?) in an html document

Question

It's a challenge!

As the title says, I would like to match everything but the content of the tags <pre>, <code> and <textarea> in an HTML document (for example you can try on the following text).

The purpose in my case is for a compression of html with removal of and other cleanup except where it is strictly required like in textarea.

As I work in PHP I also thought about extracting those tags content, treat the rest in PHP and reinject them in PHP. But I'm very curious of a way to do that in regexp!

I tried on the great online editor: http://regex101.com/ the expression ((?=.?)((?!<pre>).)) with the flags 'msg' but is not exactly what I want.

Any help would be much appreciated!

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna <span>aliquam</span> erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.

<pre>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.
Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem.</pre>

Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius.
Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum.
<pre>Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima.</pre>
Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum.

Casimir et Hippolyte · Accepted Answer

You can use this:

$pattern = <<<'LOD'
~
# definitions : 
(?(DEFINE) (?<tagBL> pre | code | textarea | style | script )
     (?<tagContent> < (\g<tagBL>) \b .*? </ \g{-1} > )
     (?<tags> < [^>]* > )
     (?<cdata> <!\[CDATA .*? ]]> )

     (?<exclusionList> \g<tagContent> | \g<cdata> | \g<tags>)
)

# pattern :
\g<exclusionList> (*SKIP) (*FAIL) | \s+
~xsi
LOD;

$html = preg_replace($pattern, ' ', $html);

Note that this is a general approach, you can easily adapt it to a specific case by adding or removing things to the exclusion list. If you need other type of replacements you can adapt it too by using capturing groups and preg_replace_callback().

An other notice: an html tag stay open until a closing tag. If the closing tag doesn't exist all the content after the tag belongs to this tag until the end of the string. To deal with that, you can change </ \g{-1} > to (?: </ (?:\g{-1}| head | body | html) > | $) in the tag content definition for example, or compose more advanced rules.

EDIT:

Some informations you can find in the php manual:

The nowdoc syntax is an alternative syntax to define strings.
It can be very useful to make more readable a multiline string without modifying his layout and avoiding questions about escaping quotes or not.
The nowdoc syntax have the same behaviour than single quotes, i.e. variables are not interpreted as escaped format markers like \t or \n. If you want the same behaviour than double quotes, use the heredoc syntax.

some informations you can find in http://pcre.org/pcre.txt:

First at all: The pattern delimiter

Most of the time, people write their patterns with the / delimiter. /Gnagnagna/, /blablabla/ixUums, etc.
But when they write a pattern with about a thousand or a million of slash characters, they prefer escaping each of the thousand slashes, one by one, to choose an other delimiter! With PHP, you can choose the pattern delimiter you want if it is not an alphanumeric character. I have choosen ~ instead of / for three reasons:

If I choose ~, I don't have to escape slashes, because there is no ambiguity with the delimiter and a literal character.
I have never seen during height months in this site, somebody who ask for a pattern with a tilde inside.
I'm sure if one day someone asks a pattern with a tilde is that I have had an encounter of the third kind.

Second: How to make a long pattern more readable?

PCRE (Perl Common Regular Expression, the regex engine used by PHP) has ways to make a code more readable. These ways are exactly the same you can find in common code:

You can ignore white spaces
You can add comments
You can define subpatterns

For 1 and 2, it's easy, you only need to add the x modifier (it is the reason why you find an x at the end). The x modifier allows the verbose mode where white spaces are ignored and where you can add comments like this # comment at ends of line.

About subpatterns: You can used named groups, example: instead of writing ~([0-9]+)~ to match and capture a number inside group 1, you can write ~(?<number>[0-9]+)~. Now, with this named subpattern, you can refer to the captured content with \g{number} or to the pattern itself with \g<number>, anywhere in the pattern. Examples:

~^(?<num>[0-9]+)(?<letter>[a-z]+)\g<num>\g<letter>$~

will match 45ab67cd

~^(?<num>[0-9]+)(?<letter>[a-z]+)\g{num}\g<letter>$~

will match 45ab45cd but not 45ab67cd

In these two examples, named subpatterns are part of the main pattern and match the start of the string. But using the (?(DEFINE)...) syntax, you can define them out of the main pattern, because all that you write between these parenthesis are not matched.

~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>$~

doesn't match 45ab67cd, because all inside the DEFINE part is ignored for the match, but:

~(?(DEFINE)(?<num>[0-9]+)(?<letter>[a-z]+))^\g<num>\g<letter>\g<num>\g<letter>$~

does.

Third: relative backreferences

When you use a capturing group in a pattern, you can use a reference to the captured content, example:

$str = 'cats meow because cats are bad.';

$pattern = '~^(\w+) \w+ \w+ \1 \w+ \w+\.$~';

var_dump(preg_match($pattern, $str));

the current code return true since the pattern matches the string. In the pattern, \1 refers to the content (cats) of the first capturing group. Instead of writing \1, you can use the oniguruma syntax and writing \g{1} that refers to the first capturing group too, it is the same.

Now, if you want to refer to the content of the last capturing group, but you don't care about the number (or the name) of the group, you can use a relative reference by writing \g{-1} (i.e. the first group on my left)

Fourth: the modifiers xsi

The general behaviour of a pattern can be changed by modifiers. Here I used three modifiers:

x # for verbose mode
i # make the pattern case insensitive (i.e. '~CaT~i' will match "cat")
s # (singleline mode): by default the . doesn't match newline, with the s modifier it does.

The last: Backtracking control verbs

Backtracking control verbs are an experimental feature herited from the perl regex engine (the state is experimental in perl too, but if nobody use it, it will not change).

What is the backtracking?

if I try to match "aaaaab" with ~a+ab~ the regex engine, since + is a greedy quantifier, will catch all the a (five a), but after it stay only a b that does not match the subpattern ab. The only way for the regex engine is to get back one a, and then it is possible to match ab. It is the default behaviour of the regex engine.

More about backtracking here and here.

The backtracking control verbs are tools that enforces the regex engine to have the behaviour you want for a subpattern.

Here I used two verbs : (*SKIP) and (*FAIL)

(*FAIL) is the most easy. The subpattern is forced to fail immediatly.

(*SKIP): when a subpattern will fail after this verb, the regex engine don't have the right to backtrack characters matched before this verb. And this content can't be reused for another alternative subpattern.

I understand that all these things are not always easy, but I hope that, step by step, one day, all of these things will be clear for you.

alfonsodev · Answer

If you want parse html, I would suggest you to use PHP DOMxpath or similar, as it's meant and specialised for that task. You'll find chrome extensions to test your queries.

Also read this answer, it's funny: You can't parse [X]HTML with regex. Because HTML can't be parsed by regex was voted more than 4400 times

edit: With that said, may be your need to parse only fragments or not valid html, then I'll go with a "simple" regex approach like Steve P answered above.

regexp: match all but every <(pre|code|textarea)>(.*?)</\\1> in an html document

Tags:

html

regex

php

antoni

2 Answers

Casimir et Hippolyte

alfonsodev

Recent Activity

Donate For Us