How can I remove all the HTML tags including   using regex in C#. My string looks like
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.
Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.
Definition and Usage. The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.
public static string ScrubHtml(string value) { var step1 = Regex.Replace(value, @"<[^>]+>| ", "").Trim(); var step2 = Regex.Replace(step1, @"\s{2,}", " "); return step2; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With