Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.

For example if the actual html code inside the element is "How to fix"

Running this JavaScript:

getElementById('myE').innerHTML

Gives me "How to fix" which is the parsed HTML.

How can I get the unparsed "How to fix" using JavaScript?

like image 904
Melina Avatar asked Oct 11 '10 10:10

Melina


People also ask

How do I view HTML code without rendering?

You can show HTML tags as plain text in HTML on a website or webpage by replacing < with &lt; or &60; and > with &gt; or &62; on each HTML tag that you want to be visible. Ordinarily, HTML tags are not visible to the reader on the browser.

What is raw HTML?

Raw HTML is a basic HTML file that does not have any code specific or content management specific tags or markup. Examples of the latter are: - Files with Cold Fusion markup in it. - Files that usually end in a programming extension (. asp, . php, .


2 Answers

You cannot get the actual HTML source of part of your web page.

When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned. The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix—but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw ( ).

When you ask the browser for an element node's innerHTML, it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information. Instead, it generates new HTML from the data stored in the DOM. The browser decides on how to format that HTML serialisation; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.

In particular,

  • element names may be upper- or lower-cased;

  • attributes may not be in the same order as you stated them in the HTML;

  • attribute quoting may not be the same as in your source. IE often generates unquoted attributes that aren't even valid HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML;

  • it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes. Instead of returning &nbsp; it may simply give you the raw   character.

You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one. You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to &nbsp;... if you do, for some reason, you can get that by doing:

x= el.innerHTML.replace(/\xA0/g, '&nbsp;')

but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.

If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL (location.href) and get the full, unparsed HTML source in the responseText. There is almost never a good reason to do this.

like image 96
bobince Avatar answered Oct 18 '22 17:10

bobince


What you have should work:

Element test:

<div id="myE">How to&nbsp;fix</div>​

JavaScript test:

alert(document.getElementById("myE​​​​​​​​").innerHTML); //alerts "How to&nbsp;fix"

You can try it out here. Make sure that wherever you're using the result isn't show &nbsp; as a space, which is likely the case. If you want to show it somewhere that's designed for HTML, you'll need to escape it.

like image 44
Nick Craver Avatar answered Oct 18 '22 19:10

Nick Craver