I'm in need of a regular expression that would allow anything except for HTML tags. The trick here is that < and > characters would be allowed, but just not with text between them (but other characters are fine).
The following would be allowed:
hello world
!@$%^&*()_+'":;[]{}()\|#
<<<<<<<
>>>>>
<>
><
<087>
<-->
The following would not be allowed
<html>
<a>
<foo>
<bar>
I've tried several expressions with no luck. This turned out to be surprisingly harder than it seemed at first (for me anyway :P)
EDIT: Basically, anything is allowed except: A-Z
and a-z
between <
and >
characters.
If you are doing this to prevent HTML injection on a website then a much better solution is to just escape HTML special characters before sending them to the browser. Most web development environments/libraries will have a standard function to do this, for example PHP has htmlentities
and htmlspecialchars
functions.
Shockingly, since you described your use case, it actually sounds like regexen will work here: you need to prevent <SomeTextHere>
from showing up without any restrictions on where, and certainly no need to worry about recursion. The following regex will do the opposite of what you want: <[A-Za-z]+>
(changing the +
to a *
if you can't allow <>
). This will match everywhere such text occurs; I'd recommend putting the logic in the language instead (e.g., if (!/<[A-Za-z]+>/) { do_something() }
). If you need it in the regex, and if your language supports such things, you can use a negative look-ahead assertion: ^(?!.*<[A-Za-z]+>)
. This says "match at the beginning of the string (^
) if I can't find ((?!...)
) the given text—but your matched string will contain no characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With