Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read lines from a file with variable line endings in Go

Tags:

How can I read lines from a file where the line endings are carriage return (CR), newline (NL), or both?

The PDF specification allows lines to end with CR, LF, or CRLF.

  • bufio.Reader.ReadString() and bufio.Reader.ReadBytes() allow a single delimiter byte.

  • bufio.Scanner.Scan() handles \n optionally preceded by \r, but not a lone \r.

    The end-of-line marker is one optional carriage return followed by one mandatory newline.

Do I need to write my own function that uses bufio.Reader.ReadByte()?

like image 508
Ralph Avatar asked Jan 02 '17 21:01

Ralph


People also ask

How do I know if a file is LF or CR LF?

use a text editor like notepad++ that can help you with understanding the line ends. It will show you the line end formats used as either Unix(LF) or Macintosh(CR) or Windows(CR LF) on the task bar of the tool. you can also go to View->Show Symbol->Show End Of Line to display the line ends as LF/ CR LF/CR.

How do you check carriage returns in a text file?

Open any text file and click on the pilcrow (¶) button. Notepad++ will show all of the characters with newline characters in either the CR and LF format. If it is a Windows EOL encoded file, the newline characters of CR LF will appear (\r\n). If the file is UNIX or Mac EOL encoded, then it will only show LF (\n).

How do I view CR LF in Linux?

You can use vim -b filename to edit a file in binary mode, which will show ^M characters for carriage return and a new line is indicative of LF being present, indicating Windows CRLF line endings. By LF I mean \n and by CR I mean \r .

How do I check for carriage return in Unix?

The carriage return, also referred to as Ctrl+M, character would show up as an octal 15 if you were looking at the file with an od octal dump) command. The characters CRLF are often used to represent the carriage return and linefeed sequence that ends lines on Windows text files.


1 Answers

You can write custom bufio.SplitFunc for bufio.Scanner. E.g:

// Mostly bufio.ScanLines code:
func ScanPDFLines(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.IndexAny(data, "\r\n"); i >= 0 {
        if data[i] == '\n' {
            // We have a line terminated by single newline.
            return i + 1, data[0:i], nil
        }
        advance = i + 1
        if len(data) > i+1 && data[i+1] == '\n' {
            advance += 1
        }
        return advance, data[0:i], nil
    }
    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
        return len(data), data, nil
    }
    // Request more data.
    return 0, nil, nil
}

And use it like:

scan := bufio.NewScanner(r)
scan.Split(ScanPDFLines)
like image 175
kopiczko Avatar answered Sep 22 '22 10:09

kopiczko