Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx to extract text between a HTML tag

Tags:

java

regex

I'm looking a regular expression which must extract text between HTML tag of different types.

For ex:

<span>Span 1</span> - O/p: Span 1

<div onclick="callMe()">Span 2</div> - O/p: Span 2

<a href="#">HyperText</a> - O/p: HyperText

I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> from here But this one is not working.

like image 665
Sriram Avatar asked Jan 14 '23 19:01

Sriram


2 Answers

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

like image 57
MikeM Avatar answered Jan 31 '23 08:01

MikeM


This should suit your needs:

<([a-zA-Z]+).*?>(.*?)</\\1>

The first group contains the tag name, the second one the value inbetween.

like image 32
sp00m Avatar answered Jan 31 '23 09:01

sp00m