I am using javascript and want to traverse the HTML tree, getting all the text as it appears to the user. However, I am losing spacing information.
Let's say I have two docs:
<html>XXX<p>YY YY</p><html>
<html>XXX<p>YY YY</p><html>
The first one will appear with 1 space between the Ys. The second will have 3 spaces. However, if I traverse the tree and, for each #text node, use:
text = node.nodeValue;
then the text for both nodes will have 3 spaces. I no longer know which one has the "real" nbsp spaces. I can use node.innerHTML for the p elements, which will show the nbsp, but I don't think that I can use innerHTML to get just the XXX text (without some kind of text subtraction).
I could just get innerHTML of the whole document and parse that. However, I also need to get the computed style of each element, which I am going to get using
window.getComputedStyle(theElement).getPropertyValue("text-align");
So, I will be traversing each node. Also, innerHTML shows the source as is, while traversing the nodes "fixes" html errors, adding end tags, etc. That's a good thing and something I'd like to keep.
What if you test by charCode? I believe a regular space is 32
, while
is 160
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With