Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rendering Plaintext as HTML maintaining whitespace – without <pre>

Given any arbitrary text file full of printable characters, how can this be converted to HTML that would be rendered exactly the same (with the following requirements)?

  • Does not rely on any but the default HTML whitespace rules
    • No <pre> tag
    • No CSS white-space rules
  • <p> tags are fine, but not required (<br />s and/or <div>s are fine)
  • Whitespace is maintained exactly.

    Given the following lines of input (ignore erroneous auto syntax highlighting):

    Line one
        Line two, indented    four spaces
    

    A browser should render the output exactly the same, maintaining the indentation of the second line and the gap between "indented" and "spaces". Of course, I am not actually looking for monospaced output, and the font is orthogonal to the algorithm/markup.

    Given the two lines as a complete input file, example correct output would be:

    Line one<br />&nbsp;&nbsp;&nbsp;&nbsp;Line two, 
    indented&nbsp;&nbsp;&nbsp; four spaces
    
  • Soft wrapping in the browser is desirable. That is, the resulting HTML should not force the user to scroll, even when input lines are wider than their viewport (assuming individual words are still narrowing than said viewport).

I’m looking for fully defined algorithm. Bonus points for implementation in python or javascript.

(Please do not just answer that I should be using <pre> tags or a CSS white-space rule, as my requirements render those options untenable. Please also don’t post untested and/or naïve suggestions such as “replace all spaces with &nbsp;.” After all, I’m positive a solution is technically possible — it’s an interesting problem, don’t you think?)

like image 592
Alan H. Avatar asked Feb 15 '11 18:02

Alan H.


People also ask

How do you preserve a space in HTML?

The HTML <pre> tag defines preformatted text preserving both whitespace and line breaks in the HTML document. This tag is also commonly referred to as the <pre> element.

What is white space No wrap?

nowrap : Multiple whitespaces are collapsed into one, but the text doesn't wrap to the next line. We've already discussed how to use the nowrap value to prevent line breaks. pre : Same results as using the <pre> where all the whitespaces will be kept as is and the text only wraps when line breaks are in the content.

What is collapsing white space in HTML?

White space refers to empty or blank values in the code which the browser reads and renders. Html has a special feature of collapsing these white spaces. If you put extra/consecutive white spaces or newlines in the code it will regard it as one white space this is known as collapsing of white spaces.

How do you put a space before text in CSS?

In CSS, you can use either the margin or padding properties to add space around elements. Additionally, the text-indent property adds space to the front of the text, such as for indenting paragraphs.


3 Answers

The solution to do that while still allowing the browser to wrap long lines is to replace each sequence of two spaces with a space and a non break space.

The browser will correctly render all spaces (normal and non break ones), while still wrapping long lines (due to normal spaces).

Javascript:

text = html_escape(text); // dummy function
text = text.replace(/\t/g, '    ')
           .replace(/  /g, '&nbsp; ')
           .replace(/  /g, ' &nbsp;') // second pass
                                      // handles odd number of spaces, where we 
                                      // end up with "&nbsp;" + " " + " "
           .replace(/\r\n|\n|\r/g, '<br />');
like image 125
Arnaud Le Blanc Avatar answered Sep 28 '22 02:09

Arnaud Le Blanc


Use a zero-width space (&#8203;) to preserve whitespace and allow the text to wrap. The basic idea is to pair each space or sequence of spaces with a zero-width space. Then replace each space with a non-breaking space. You'll also want to encode html and add line breaks.

If you don't care about unicode characters, it's trivial. You can just use string.replace():

function textToHTML(text)
{
    return ((text || "") + "")  // make sure it is a string;
        .replace(/&/g, "&amp;")
        .replace(/</g, "&lt;")
        .replace(/>/g, "&gt;")
        .replace(/\t/g, "    ")
        .replace(/ /g, "&#8203;&nbsp;&#8203;")
        .replace(/\r\n|\r|\n/g, "<br />");
}

If it's ok for the white space to wrap, pair each space with a zero-width space as above. Otherwise, to keep white space together, pair each sequence of spaces with a zero-width space:

    .replace(/ /g, "&nbsp;")
    .replace(/((&nbsp;)+)/g, "&#8203;$1&#8203;")

To encode unicode characters, it's a little more complex. You need to iterate the string:

var charEncodings = {
    "\t": "&nbsp;&nbsp;&nbsp;&nbsp;",
    " ": "&nbsp;",
    "&": "&amp;",
    "<": "&lt;",
    ">": "&gt;",
    "\n": "<br />",
    "\r": "<br />"
};
var space = /[\t ]/;
var noWidthSpace = "&#8203;";
function textToHTML(text)
{
    text = (text || "") + "";  // make sure it is a string;
    text = text.replace(/\r\n/g, "\n");  // avoid adding two <br /> tags
    var html = "";
    var lastChar = "";
    for (var i in text)
    {
        var char = text[i];
        var charCode = text.charCodeAt(i);
        if (space.test(char) && !space.test(lastChar) && space.test(text[i + 1] || ""))
        {
            html += noWidthSpace;
        }
        html += char in charEncodings ? charEncodings[char] :
        charCode > 127 ? "&#" + charCode + ";" : char;
        lastChar = char;
    }
    return html;
}  

Now, just a comment. Without using monospace fonts, you'll lose some formatting. Consider how these lines of text with a monospace font form columns:

ten       seven spaces
eleven    four spaces

Without the monospaced font, you will lose the columns:

 ten       seven spaces
 eleven    four spaces

It seems that the algorithm to fix that would be very complex.

like image 33
gilly3 Avatar answered Sep 28 '22 04:09

gilly3


While this doesn't quite meet all your requirements — for one thing it doesn't handle tabs, I've used the following gem, which adds a wordWrap() method to Javascript Strings, on a couple of occasions to do something similar to what you're describing — so it might be a good starting point to come up with something that also does the additional things you want.

//+ Jonas Raoni Soares Silva
//@ http://jsfromhell.com/string/wordwrap [rev. #2]

// String.wordWrap(maxLength: Integer,
//                 [breakWith: String = "\n"],
//                 [cutType: Integer = 0]): String
//
//   Returns an string with the extra characters/words "broken".
//
//     maxLength  maximum amount of characters per line
//     breakWith  string that will be added whenever one is needed to
//                break the line
//     cutType    0 = words longer than "maxLength" will not be broken
//                1 = words will be broken when needed
//                2 = any word that trespasses the limit will be broken

String.prototype.wordWrap = function(m, b, c){
    var i, j, l, s, r;
    if(m < 1)
        return this;
    for(i = -1, l = (r = this.split("\n")).length; ++i < l; r[i] += s)
        for(s = r[i], r[i] = ""; s.length > m; r[i] += s.slice(0, j) + ((s = s.slice(j)).length ? b : ""))
            j = c == 2 || (j = s.slice(0, m + 1).match(/\S*(\s)?$/))[1] ? m : j.input.length - j[0].length
            || c == 1 && m || j.input.length + (j = s.slice(m).match(/^\S*/)).input.length;
    return r.join("\n");
};

I'd also like to comment that it seems to me as though, in general, you'd want to use a monospaced font if tabs are involved because the width of words would vary with the proportional font used (making the results of using of tab stops very font dependent).

Update: Here's a slightly more readable version courtesy of an online javascript beautifier:

String.prototype.wordWrap = function(m, b, c) {
    var i, j, l, s, r;
    if (m < 1)
        return this;
    for (i = -1, l = (r = this.split("\n")).length; ++i < l; r[i] += s)
        for (s = r[i], r[i] = ""; s.length > m; r[i] += s.slice(0, j) + ((s =
                s.slice(j)).length ? b : ""))
            j = c == 2 || (j = s.slice(0, m + 1).match(/\S*(\s)?$/))[1] ? m :
            j.input.length - j[0].length || c == 1 && m || j.input.length +
            (j = s.slice(m).match(/^\S*/)).input.length;
    return r.join("\n");
};
like image 26
martineau Avatar answered Sep 28 '22 03:09

martineau