Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegExp. Get only text content of tag (without inner tags)

I have string with html code.

<h2 class="some-class"> 
   <a href="#link" class="link" id="first-link"
      <span class="bold">link</span>
   </a>
   NEED TO GET THIS
</h2>

I need to get only text content of h2. I create this regular expression:

(?<=>)(.*)(?=<\/h2>)

But it's useful if h2 has no inner tags. Otherwise I get this:

   <a href="#link" class="link" id="first-link"
      <span class="bold">link</span>
   </a>
   NEED TO GET THIS
like image 323
andreyb1990 Avatar asked Mar 04 '17 16:03

andreyb1990


2 Answers

Never use regex to parse HTML, check these famous answers:

Using regular expressions to parse HTML: why not?

RegEx match open tags except XHTML self-contained tags


Instead, generate a temp element with the text as HTML and get content by filtering out text nodes.

var str = `<h2 class="some-class"> 
   <a href="#link" class="link" id="first-link"
      <span class="bold">link</span>
   </a>
   NEED TO GET THIS
</h2>`;

// generate a temporary DOM element
var temp = document.createElement('div');
// set content
temp.innerHTML = str;
// get the h2 element
var h2 = temp.querySelector('h2');

console.log(
  // get all child nodes and convert into array
  // for older browser use [].slice.call(h2...)
  Array.from(h2.childNodes)
  // iterate over elements
  .map(function(e) {
    // if text node then return the content, else return 
    // empty string
    return e.nodeType === 3 ? e.textContent.trim() : '';
  })
  // join the string array
  .join('')
  // you can use reduce method instead of map
  // .reduce(function(s, e) { return s + (e.nodeType === 3 ? e.textContent.trim() : ''); }, '') 
)

Reference :

Fastest way to convert JavaScript NodeList to Array?

like image 192
Pranav C Balan Avatar answered Oct 04 '22 20:10

Pranav C Balan


Rgex is not good for parsing HTML, but if your html is not valid or any way you like to use regex:

(?!>)([^><]+)(?=<\/h2>)

try Demo

  • It's getting last texts before closing tag of </h2> (IF EXISTS)

  • To avoid null results changed * to +.

  • This Regex is completely limit and fitting to limited situations as question mentioned.

like image 31
MohaMad Avatar answered Oct 04 '22 22:10

MohaMad