Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve JavaScript regex to match content inside of tags with or without closing tag, excluding self

Preface: I'm aware about general consensus standing against using regex to parse HTML. Asking you in advance, please avoid any recommendations in this regard.


Explanations.

I have the following regex

/<div class="panel-body">([^]*?)(<\/div>|$)/gi

It matches all content, including self, inside of the the div with class .panel-body

Full match:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>

.. it also matches content with no closing div tag.

Full match:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing `div`...but match this and below in case closing `div` is removed.
   Line below 1
   Line below 2
   Line below 3

Question.

How could I improve my regex to do the following:

  1. Not include in the full match <div class="panel-body"> and closing </div> (when there is closing div tag)

  2. To do this straight (if possible) into the full match without using groups

regex101.com example


Edit 1:

The string doesn't start with <div class="panel-body">, it starts with

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">

* Note: It's never closed until the full load as it's progressive output.

Edit 2:

After posted answers, I made speed comparison tests. It's up to you, whose solution would serve best for you.

Speed-test Results
like image 950
Ilia Avatar asked Jan 29 '23 21:01

Ilia


2 Answers

You can use a DOM parser, that should with incomplete tags as well:

function divContent(str) {
  // create a new dov container
  var div = document.createElement('div');

  // assign your HTML to div's innerHTML
  div.innerHTML = '<html>' + str + '</html>';

  // find an element by given className
  var el = div.getElementsByClassName("panel-body");
  
  // return found element's first innerHTML
  return (el.length > 0 ? el[el.length-1].innerHTML : "");
}

// extract text from a complete tag:
var html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>`;
console.log(divContent(html));

// extract text from an incomplete tag:
html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing 'div'...but match this and below
   in case closing 'div' is removed.
   Line below 1
   Line below 2
   Line below 3`;   
console.log(divContent(html));

// OP'e edited HTML text
html = `<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">`;
console.log(divContent(html));

JS Fiddle

like image 200
anubhava Avatar answered Feb 03 '23 08:02

anubhava


I can't comment yet so I will try an answer. How about non-capturing groups, You still have it in the full match, but your only entry in matches would be the content. so index 0.

(?:<div class="panel-body">)([^]*?)(?:<\/div>|$)

https://regex101.com/r/OJf1Rt/3

like image 33
Jordan Maduro Avatar answered Feb 03 '23 06:02

Jordan Maduro