Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string of HTML into an array by particular tags

Given this HTML as a string "html", how can I split it into an array where each header <h marks the start of an element?

Begin with this:

<h1>A</h1>
<h2>B</h2>
<p>Foobar</p>
<h3>C</h3>

Result:

["<h1>A</h1>", "<h2>B</h2><p>Foobar</p>", "<h3>C</h3>"]

What I've tried:

I wanted to use Array.split() with a regex, but the result splits each <h into its own element. I need to figure out how to capture from the start of one <h until the next <h. Then include the first one but exclude the second one.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';
var foo = html.split(/(<h)/);

Edit: Regex is not a requirement in anyway, it's just the only solution that I thought would work for generally splitting HTML strings in this way.

like image 712
Don P Avatar asked Dec 28 '15 10:12

Don P


Video Answer


2 Answers

In your example you can use:

/
  <h   // Match literal <h
  (.)  // Match any character and save in a group
  >    // Match literal <
  .*?  // Match any character zero or more times, non greedy
  <\/h // Match literal </h
  \1   // Match what previous grouped in (.)
  >    // Match literal >
/g
var str = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>'
str.match(/<h(.)>.*?<\/h\1>/g); // ["<h1>A</h1>", "<h2>B</h2>", "<h3>C</h3>"]

But please don't parse HTML with regexp, read RegEx match open tags except XHTML self-contained tags

like image 131
Andreas Louv Avatar answered Sep 17 '22 11:09

Andreas Louv


I'm sure someone could reduce the for loop to put the angle brackets back in but this is how I'd do it.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';

//split on ><
var arr = html.split(/></g);

//split removes the >< so we need to determine where to put them back in.
for(var i = 0; i < arr.length; i++){
  if(arr[i].substring(0, 1) != '<'){
    arr[i] = '<' + arr[i];
  }

  if(arr[i].slice(-1) != '>'){
    arr[i] = arr[i] + '>';
  }
}

Additionally, we could actually remove the first and last bracket, do the split and then replace the angle brackets to the whole thing.

var html = '<h1>A</h1><h2>B</h2><p>Foobar</p><h3>C</h3>';

//remove first and last characters
html = html.substring(1, html.length-1);

//do the split on ><
var arr = html.split(/></g);

//add the brackets back in
for(var i = 0; i < arr.length; i++){
    arr[i] = '<' + arr[i] + '>';
}

Oh, of course this will fail with elements that have no content.

like image 26
Donnie D'Amato Avatar answered Sep 18 '22 11:09

Donnie D'Amato