Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to remove HTML tags from a string [duplicate]

Tags:

html

regex

Possible Duplicate:
Regular expression to remove HTML tags

Is there an expression which will get the value between two HTML tags?

Given this:

<td class="played">0</td> 

I am looking for an expression which will return 0, stripping the <td> tags.

like image 859
danny Avatar asked Jun 27 '12 15:06

danny


People also ask

How to remove HTML tags from string using regex?

We can use the string's replace instance method to remove the tags from an HTML string. For instance, we can write: const regex = /(<([^>]+)>)/ig const body = "<p>test</p>" const result = body. replace(regex, ""); console.

How do I remove a tag from a string?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

What are flags in regex?

A regular expression consists of a pattern and optional flags: g , i , m , u , s , y . Without flags and special symbols (that we'll study later), the search by a regexp is the same as a substring search. The method str. match(regexp) looks for matches: all of them if there's g flag, otherwise, only the first one.

How does regex replace work?

The REGEXREPLACE( ) function uses a regular expression to find matching patterns in data, and replaces any matching values with a new string. standardizes spacing in character data by replacing one or more spaces between text characters with a single space.


1 Answers

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.


String target = someString.replaceAll("<[^>]*>", ""); 

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

String target = someString.replaceAll("(?i)<td[^>]*>", ""); 

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim(); 

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

like image 126
Roddy of the Frozen Peas Avatar answered Sep 21 '22 17:09

Roddy of the Frozen Peas