Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping out html tags in string

I have a program I'm writing that is supposed to strip html tags out of a string. I've been trying to replace all strings that start with "<" and end with ">". This (obviously because I'm here asking this) has not worked so far. Here's what I've tried:

StrippedContent = Regex.Replace(StrippedContent, "\<.*\>", "")

That just returns what seems like a random part of the original string. I've also tried

For Each StringMatch As Match In Regex.Matches(StrippedContent, "\<.*\>")
    StrippedContent = StrippedContent.Replace(StringMatch.Value, "")
Next

Which did the same thing (returns what seems like a random part of the original string). Is there a better way to do this? By better I mean a way that works.

like image 967
y-- Avatar asked Jul 15 '13 23:07

y--


1 Answers

Description

This expression will:

  • find and replace all tags with nothing
  • avoid problematic edge cases

Regex: <(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

Replace with: nothing

enter image description here

Example

Sample Text

Note the difficult edge case in the mouse over function

these are <a onmouseover=' href="NotYourHref" ; if (6/a>3) { funRotator(href) } ; ' href=abc.aspx?filter=3&prefix=&num=11&suffix=>the droids</a> you are looking for.

Code

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim sourcestring as String = "replace with your source string"
    Dim replacementstring as String = ""
    Dim matchpattern as String = "<(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*>"
    Console.Writeline(regex.Replace(sourcestring,matchpattern,replacementstring,RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace OR RegexOptions.Multiline OR RegexOptions.Singleline))
  End Sub
End Module

String after replacement

these are the droids you are looking for.
like image 179
Ro Yo Mi Avatar answered Oct 12 '22 06:10

Ro Yo Mi