Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read files line by line with \r, \n or \r\n as line separator

Tags:

ruby

I want to process files line by line. However, these files have different line separators: "\r", "\n" or "\r\n". I don't know which one they use or which kind of OS they come from.

I have two solutions:

  1. using bash command to translate these separators to "\n".

    cat file |
    tr '\r\n' '\n' |
    tr '\r' '\n' |
    ruby process.rb
    
  2. read the whole file and gsub these separators

    text=File.open('xxx.txt').read
    text.gsub!(/\r\n?/, "\n")
    text.each_line do |line|
      do some thing
    end
    

but the second solution is not good when the file is huge. See reference. Is there any other ruby idiomatic and efficient solution?

like image 745
ryan Avatar asked Jan 27 '15 03:01

ryan


People also ask

How to find crlf line separators in intellij?

From the main menu, select File | File Properties | Line Separators, and then select a line ending style from the list.

What are line separators?

The line separator used by the in-memory representation of file contents is always the newline character. When a file is being loaded, the line separator used in the file on disk is stored in a per-buffer property, and all line-endings are converted to newline characters for the in-memory representation.

How do I go from line to line in R?

By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line. It is this command that does all the work; it extracts one line from the desired file.


1 Answers

I suggest you first determine the line separator. I've assumed that you can do that by reading characters until you encounter "\n" or "\r" (or reach the end of the file, in which case we can regard "\n" as the line separator). If the character "\n" is found, I assume that to be the separator; if "\r" is found I attempt to read the next character. If I can do so and it is "\n", I return "\r\n" as the separator. If "\r" is the last character in the file or is followed by a character other than "\n", I return "\r" as the separator.

def separator(fname)
  f = File.open(fname)
  enum = f.each_char
  c = enum.next
  loop do
    case c[/\r|\n/]
    when "\n" then break
    when "\r"
      c << "\n" if enum.peek=="\n"
      break
    end
    c = enum.next
  end
  c[0][/\r|\n/] ? c : "\n"
end

Then process the file line-by-line

def process(fname)
  sep = separator(fname)
  IO.foreach(fname, sep) { |line| puts line }
end

I haven't converted "\r" or "\r\n" to "\n", but of course you could do that easily. Just open a file for writing and in process read each line and write it to the output file with the default line separator.

Let's try it (for clarity I show the value returned by separator):

fname = "temp"

IO.write(fname, "slash n line 1\nslash n line 2\n")
  #=> 30 
separator(fname)                                    
  #=> "\n" 
process(fname)
  # slash n line 1
  # slash n line 2

IO.write(fname, "slash r line 1\rslash r line 2\r", )
  #=> 30 
separator(fname)
  #=> "\r" 
process(fname)
  # slash r line 1
  # slash r line 2

IO.write(fname, "slash r slash n line 1\r\nslash r slash n line 2\r\n")
  #=> 48 
separator(fname)
  #=> "\r\n" 
process(fname)
  # slash r slash n line 1
  # slash r slash n line 2
like image 195
Cary Swoveland Avatar answered Nov 15 '22 05:11

Cary Swoveland