Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert HTML to valid XHTML?

I have a string of HTML, in this example it looks like

<img src="somepic.jpg" someAtrib="1" >

I am trying to workout a peice of regex that will match the 'img' node and apply a slash to the end of the node so it looks like.

<img src="somepic.jpg" someAtrib="1" />

Essentially the end goal here is to ensure that the node is closed, open nodes are valid in HTML but not XML obviously. Are there any regex buff's out there able to help?

like image 928
John Avatar asked Aug 23 '12 13:08

John


2 Answers

You can create a xhtml document and import/adopt html elements. Html strings can be parsed by HTMLElement.innerHTML property, of cause. The key point is using Document.importNode() or Document.adoptNode() method to convert html nodes to xhtml nodes:

var di = document.implementation;
var hd = di.createHTMLDocument();
var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
hd.body.innerHTML = '<img>';
var img = hd.body.firstElementChild;
var xb = xd.createElement('body');
xd.documentElement.appendChild(xb);
console.log('html doc:\n' + hd.documentElement.outerHTML + '\n');
console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n');
img = xd.importNode(img); //or xd.adoptNode(img). Now img is a xhtml element
xb.appendChild(img);
console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n');

The output should be:

html doc:
<html><head></head><body><img></body></html>

xhtml doc:
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>

xhtml doc after import/adopt img from html:
<html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html>

Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support 'text/html' type and XMLSerializer generates html syntax(NOT xhtml) for html document in chrome.

like image 38
Duan Yao Avatar answered Oct 05 '22 22:10

Duan Yao


Don't use a Regular expression, but dedicated parsers. In JavaScript, create a document using the DOMParser, then serialize it using the XMLSerializer:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);
// result:
// <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break)
// <img src="foo" /></body></html>

You have to use xmldom if you required to use this with nodejs backend. npm i xmldom.

like image 90
Rob W Avatar answered Oct 05 '22 22:10

Rob W