Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove HTML tags from string including &nbsp in C#

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>" 
like image 712
rampuriyaaa Avatar asked Oct 22 '13 16:10

rampuriyaaa


People also ask

How do I remove all tags from a string?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

How do you remove HTML tags in HTML?

Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.

Which tag is used to remove all HTML tags from a string?

Definition and Usage. The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.


2 Answers

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim(); 

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " "); 
like image 121
Ravi K Thapliyal Avatar answered Sep 29 '22 04:09

Ravi K Thapliyal


I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

public static string ScrubHtml(string value) {     var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();     var step2 = Regex.Replace(step1, @"\s{2,}", " ");     return step2; } 
like image 30
Don Rolling Avatar answered Sep 29 '22 03:09

Don Rolling