Why is "" being injected into my HTML?

Question

EDIT: You can see the issue here (look in source).

EDIT2: Interesting, it is not an issue in source. Only with the console (Firebug as well).

I have the following markup in a file called test.html:

<!DOCTYPE html> <html> <head>     <title>Test Harness</title>     <link href='/css/main.css' rel='stylesheet' type='text/css' /> </head> <body>     <h3>Test Harness</h3> </body> </html>

But in Chrome, I see:

<!DOCTYPE html> <html> <head> </head> <body>     "&#8203;           "     <title>Test Harness</title>     <link href='/css/main.css' rel='stylesheet' type='text/css' />     <h3>Test Harness</h3> </body> </html>

It looks like &#802 is a zero width space, but what is causing it? I am using Sublime Text 2 with UTF-8 encoding and Google App Engine with Jinja2 (but Jinja is simply loading test.html). Any thoughts?

Thanks in advance.

Jukka K. Korpela · Accepted Answer

It is an issue in the source. The live example that you provided starts with the following bytes (i.e., they appear before <!DOCTYPE html>): 0xE2 0x80 0x8B. This can be seen e.g. using Rex Swain’s HTTP Viewer by selecting “Hex” under “Display Format”. Also note that validating the page with the W3C Markup Validator gives information that suggests that there is something very wrong at the start of the document, especially the message “Line 1, Column 1: Non-space characters found without seeing a doctype first.”

What happens in the validator and in the Chrome tools – as well as e.g. in Firebug – is that the bytes 0xE2 0x80 0x8B are taken as character data, which implicitly starts the body element (since character data cannot validly appear in the head element or before it), implying an empty head element before it.

The solution, of course, is to remove those bytes. Browsers usually ignore them, but you should not rely on such error handling, and the bytes prevent useful HTML validation. How you remove them, and how they got there in the first place, depends on your authoring environment.

Since the page is declared (in HTTP headers) as being UTF-8 encoded, those bytes represent the ZERO WIDTH SPACE (U+200B) character. It has no visible glyph and no width, so you won’t notice anything in the visual presentation even though browsers treat it as being data at the start of the body element. The notation  is a character reference for it, presumably used by browser tools to indicate the presence of a normally invisible character.

It is possible that the software that produced the HTML document was meant to insert ZERO WIDTH NO-BREAK SPACE (U+FEFF) instead. That would have been valid, since by a special convention, UTF-8 encoded data may start with this character, also known as byte order mark (BOM) when appearing at the start of data. Using U+200B instead of U+FEFF sounds like an error that software is unlikely to make, but human beings may be mistaken that way if they think of the Unicode names of the characters.

grmdgs · Answer

I understand that there is a bug in SharePoint 2013 where the HTML editor adds these characters into your content.

I've been dealing with this for a bit and this is the solution I am using which seems to be working. I added this javascript into a file referenced by my masterpage.

var elements = ["h1","h2","h3","h4","p","strong","label","span","a"]; function targetZWS(){     for (var i = 0; i < elements.length; i++) {       jQuery(elements[i]).each(function() {         removeZWS(this);       });     } } function removeZWS(target) {   jQuery(target).html(jQuery(target).html().replace(/\u200B/g,'')); }  /*load functions*/ $(document).ready(function() {     _spBodyOnLoadFunctionNames.push("targetZWS");  });

Links I looked into investigating this:

https://social.msdn.microsoft.com/Forums/sharepoint/en-US/23804eed-8f00-4b07-bc63-7662311a35a4/why-does-sharepoint-put-in-character-code-8203-in-a-richtext-field?forum=sharepointdevelopment
https://social.technet.microsoft.com/Forums/office/en-US/e87a82f0-1ab5-4aa7-bb7f-27403a7f46de/finding-8203-unicode-characters-in-my-source-code?forum=sharepointgeneral
http://www.sharepointpals.com/post/Removing-8203-in-RichTextHTML-field-Sharepoint

Why is "" being injected into my HTML?

Tags:

html

encoding

sublimetext2

jds

2 Answers

Jukka K. Korpela

grmdgs

Recent Activity

Donate For Us

Why is "&#8203;" being injected into my HTML?

Tags:

html

encoding

sublimetext2

jds

2 Answers

Jukka K. Korpela

grmdgs

Related questions

Recent Activity

Donate For Us

Why is "" being injected into my HTML?