Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting type of line breaks

What would be the most efficient (fast and reliable enough) way in JavaScript to determine the type of line breaks used in a text - Unix vs Windows.

In my Node app I have to read in large utf-8 text files and then process them based on whether they use Unix or Windows line breaks.

When the type of line breaks comes up as uncertain, I want to conclude based on which one it is most likely then.

UPDATE

As per my own answer below, the code I ended up using.

like image 455
vitaly-t Avatar asked Jan 15 '16 21:01

vitaly-t


2 Answers

Thank @Sam-Graham. I tried to produce an optimized way. Also, the output of the function is directly usable (see below example):

function getLineBreakChar(string) {
    const indexOfLF = string.indexOf('\n', 1)  // No need to check first-character
    
    if (indexOfLF === -1) {
        if (string.indexOf('\r') !== -1) return '\r'
        
        return '\n'
    }
    
    if (string[indexOfLF - 1] === '\r') return '\r\n'
    
    return '\n'
}

Note1: Supposed string is healthy (only contains one type of line-breaks).

Note2: Supposed you want LF to be default encoding (when no line-break found).


Usage example:

fs.writeFileSync(filePath,
        string.substring(0, a) +
        getLineBreakChar(string) +
        string.substring(b)
);

This utility may be useful too:

const getLineBreakName = (lineBreakChar) =>
    lineBreakChar === '\n' ? 'LF' : lineBreakChar === '\r' ? 'CR' : 'CRLF'
like image 100
Mir-Ismaili Avatar answered Sep 22 '22 00:09

Mir-Ismaili


You would want to look first for an LF. like source.indexOf('\n') and then see if the character behind it is a CR like source[source.indexOf('\n')-1] === '\r'. This way, you just find the first example of a newline and match to it. In summary,

function whichLineEnding(source) {
     var temp = source.indexOf('\n');
     if (source[temp - 1] === '\r')
         return 'CRLF'
     return 'LF'
}

There are two popularish examples of libraries doing this in the npm modules: node-newline and crlf-helper The first does a split on the entire string which is very inefficient in your case. The second uses a regex which in your case would not be quick enough.

However, from your edit, if you want to determine which is more plentiful. Then I would use the code from node-newline as it does handle that case.

like image 41
Sam-Graham Avatar answered Sep 19 '22 00:09

Sam-Graham