Break string after specific word and put remains on new line (Regex)

Question

Suppose that I have a text field in which a user can submit code snippets. I want to detect when a specific word occurs in the string and then do something with the words/characters that come after that word.

Let's say we have a string and that after the word pyjamas I want to start the rest of the code on a new line without an indent. (Very similar to how code beautifiers work.) The output will be rendered inside pre, so I don't want any <br> tags or other HTML tags.

There are some catches though.

Everything following a word (pyjamas) has to start on a new line on the same "level" (equally amount of tab indents) as the line before.
Commas should always start on a new line and reverse indented with a tab
When there is another character, let's say an exclamation mark !, the code following has to start on a new line with a tab as an indent.

Example:

Input:

Bananas! Apples and pears walk down pyjamas the street! and they say pyjamas hi to eachother, pyjamas But then! some one else comes pyjamas along pyjamas Who is he?, pyjamas I don't know who! he is pyjamas whatever,,

Output:

Bananas!
    Apples and pears walk down pyjamas
    the street!
        and they say pyjamas
        hi to eachother
    , pyjamas
    But then!
        some one else comes pyjamas
        along pyjamas
        Who is he?
    , pyjamas
    I don't know who!
        he is pyjamas
        whatever
    ,
,

I am working with jQuery, so you can use it if you want.

Here is a fiddle with the code above, so you can test it out. My result thus far is not great at all. (Type something in the textarea, the output will change.) As I'm currently only barely knowledgeable with regex, I am in need of some help.

What I have so far:

var a = $("textarea").val(),
    b = a.split('!').join("!
  "),
    c = b.split('pyjamas').join("pyjamas 
");

$("textarea").keyup(function() {
    $("#output>pre").html(c);
});

Martin Ender · Accepted Answer

Here is a simple approach that doesn't require recursive functions and could even be done without regular expressions (but I find them convenient here).

function indent(str)
{
    var tabs = function(n) { return new Array(n+1).join('	'); }

    var tokens = str.match(/!|,|pyjamas|(?:(?!pyjamas)[^!,])+/g);
    var depth = 0;
    var result = '';
    for (var i = 0; i < tokens.length; ++i)
    {
        var token = tokens[i];
        switch(token)
        {
        case '!':
            ++depth;
            result += token + '
' + tabs(depth);
            break;
        case ',':
            --depth;
            result += '
' + tabs(depth) + token;
            break;
        case 'pyjamas':
            result += token + '
' + tabs(depth);
            break;
        default:
            result += token;
            break;
        }
    }
    return result;
}

First, we define a function that returns a string of n tabs (for convenience).

Then we split up the process into two steps. First we tokenise the string - that is we split it into !, ,, pyjamas and anything else. (There's an explanation of the regex at the end, but you could do the tokenisation some other way as well.) Then we simply walk the tokens one by one keeping the current indentation level in depth.

If it's an ! we increment the depth, print the !, a line break and the tabs.
If it's a , we decrement the depth, print a line break, the tabs and then the ,.
If it's pyjamas, we simply print that and a line break and the tabs.
If it's anything else we just print that token.

That's it. You might want to add some sanity check that depth doesn't go negative (i.e. you have more , than !) - currently that would simply be rendered without any tabs, but you'd need to write extra ! after that to get the depth back up to 1. This is quite easy to deal with, but I don't know what your assumptions or requirements about that are.

It also doesn't take care of additional spaces after line breaks yet (see the edit at the end).

Working demo.

Now for the regex:

/
  !               # Match a literal !
|                 # OR
  ,               # Match a literal ,
|                 # OR
  pyjamas         # Match pyjamas
|                 # OR
  (?:             # open a non-capturing group
    (?!pyjamas)   # make sure that the next character is not the 'p' of 'pyjamas'
    [^!,]         # match a non-!, non-, character
  )+              # end of group, repeat once or more (as often as possible)
/g

The g to find all matches (as opposed to just the first one). ECMAScript 6 will come with a y modifier, which will make tokenisation even easier - but annoyingly this y modifier is ECMAScript's own invention, whereas every other flavour that provides this feature uses a \G anchor within the pattern.

If some of the more advanced concepts in the regex are not familiar to you, I refer you to this great tutorial:

negated character classes
non-capturing groups
lookaheads

EDIT:

Here is an updated version that fixes the above caveat I mentioned regarding spaces after line breaks. At the end of the processing we simply remove all spaces after tabs with:

result = result.replace(/^(	*)[ ]+/gm, '$1');

The regex matches the beginning of a line and then captures zero or more tabs, and then as many spaces as possible. The square brackets around the space are not necessary but improve readability. The modifier g is again to find all such matches and m makes ^ match at the beginning of a line (as opposed to just the beginning of the string). In the replacement string $1 refers to what we captured in the parentheses - i.e. all those tabs. So write back the tabs but swallow the spaces.

Working demo.

Casimir et Hippolyte · Answer

Not so different from m.buettner solution, you can do it using the replace method:

var lvl = 1;
var res = str.replace(/(!)\s*|\s*(,)|(\bpyjamas)\s+/g, function (m, g1, g2, g3) {
    if (g1) return g1 + "
" + Array(++lvl).join("	");
    if (g2) return "
" + Array((lvl>1)?--lvl:lvl).join("	") + g2;
    return g3 + "
" + Array(lvl).join("	"); });

console.log(res);

The idea is to use three different capturing groups and to test them in the callback function. Depending of the capture group the level is incremented or decremented (the ground is level 1). When the level is 1 and a comma is found, the level stay set to 1. I added \s* and \s+ to trim spaces before commas and after ! and pyjamas. If you don't want this, you can remove it.

With your code:

$("#output>pre").html($("textarea").val());

$("textarea").keyup(function() {
    $("#output>pre").html(function() {
        var lvl = 1;
        return $("textarea").val().replace(/(!)\s*|\s*(,)|(\bpyjamas)\s+/g,
            function (m, g1, g2, g3) {
                if (g1) return g1 + "
" + Array(++lvl).join("	");
                if (g2) return "
" + Array((lvl>1)?--lvl:lvl).join("	") + g2;
                return g3 + "
" + Array(lvl).join("	"); });
    });
});

Note: it is probably more clean to define a function that you can reuse later.

Break string after specific word and put remains on new line (Regex)

Tags:

javascript

html

jquery

string

regex

Bram Vanroy

2 Answers

Martin Ender

Casimir et Hippolyte

Recent Activity

Donate For Us

Break string after specific word and put remains on new line (Regex)

Tags:

javascript

html

jquery

string

regex

Bram Vanroy

2 Answers

Martin Ender

Casimir et Hippolyte

Related questions

Recent Activity

Donate For Us