Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the static, original HTML source via JavaScript?

While developing a tool (which I don't consider important detailing here, on the question, given that I was able to develop the MCVE's below), I noticed that, at least in the Chrome and Firefox versions that I have on my desktop, the string I get from the innerHTML attribute is not equal to the original source code I wrote statically on the HTML file.

console.log(document.querySelector("div").innerHTML);
/*
  <table>
    <tbody><tr>
      <td>Hello</td>
      <td>World</td>
    </tr>
  </tbody></table>
*/
<div>
  <table>
    <tr>
      <td>Hello</td>
      <td>World</td>
    </tr>
  </table>
</div>

As you may have noticed, a spontaneous <tbody> tag (which I have not added to my HTML source!) came out, aparently due to preprocessing some time in between the page download and the page onload event. In this particular case, for my application purposes, this modification doesn't generate an error and could thus be ignored.

Turns out that, in certain cases, this sort of alteration can be catastrophic, specially when all the markup is removed, like in the example below.

console.log(document.querySelector("div").innerHTML);
/*
  Hello
  World
*/
<div>
  <td>Hello</td>
  <td>World</td>
</div>

Obviously, in this case the original markup has issues, but in my application, "misuses" (like a <td> inside a <div>) are accepted. What is not accepted is the innerHTML being left with no HTML markup at all, which leads to the main question: how can I get the original, statically coded HTML markup for the <div> element?

Also, if possible, it would also be nice to know why and how this phenomenon occurs, because I'm curious :D

like image 481
Rui Pimentel Avatar asked Nov 26 '14 19:11

Rui Pimentel


2 Answers

The browser downloads the HTML source and parses it into a DOM (document object model). Any issues are fixed as good as possible, and elements that can be omitted in the source might be added in the DOM.

From that moment on, this memory structure is used to render the page, and it is this structure as well what you refer to in JavaScript. So if you request the innerHTML of an element, you just get a piece of HTML source code that is rendered based on the DOM. The original source is not available at all in JavaScript.

So, that's the reason why it happens. And also there is not much you can do about it. I think the only workaround is to re-load the entire page using AJAX into a string and get the required piece of source yourself.

But a better solution, obviously, would be to remove those "misuses" and make your HTML source valid. If you just need to enclose some information in the page to be used by JavaScript alone, you might choose to embed a script tag that initializes a couple of variables with those values, rather than generating some invalid HTML.

like image 155
GolezTrol Avatar answered Oct 23 '22 05:10

GolezTrol


I've tried to do something like this at work before. In some of my solutions I've structured a table, with table rows around the table data elements that I want to use, just so I can use the table datas. If you want to do a little more processing on the javascript side of things, you could potentially do something like this:

<div>
    <div class="td">Hello</div>
    <div class="td">World</div>
</div>

And then you could process this with javascript to turn the div.td's into actual td's. Just an idea.

like image 30
The Real Diel Avatar answered Oct 23 '22 03:10

The Real Diel