I want to process files line by line. However, these files have different line separators: "\r"
, "\n"
or "\r\n"
. I don't know which one they use or which kind of OS they come from.
I have two solutions:
using bash command to translate these separators to "\n"
.
cat file |
tr '\r\n' '\n' |
tr '\r' '\n' |
ruby process.rb
read the whole file and gsub these separators
text=File.open('xxx.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
do some thing
end
but the second solution is not good when the file is huge. See reference. Is there any other ruby idiomatic and efficient solution?
From the main menu, select File | File Properties | Line Separators, and then select a line ending style from the list.
The line separator used by the in-memory representation of file contents is always the newline character. When a file is being loaded, the line separator used in the file on disk is stored in a per-buffer property, and all line-endings are converted to newline characters for the in-memory representation.
By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line. It is this command that does all the work; it extracts one line from the desired file.
I suggest you first determine the line separator. I've assumed that you can do that by reading characters until you encounter "\n" or "\r" (or reach the end of the file, in which case we can regard "\n" as the line separator). If the character "\n" is found, I assume that to be the separator; if "\r" is found I attempt to read the next character. If I can do so and it is "\n", I return "\r\n" as the separator. If "\r" is the last character in the file or is followed by a character other than "\n", I return "\r" as the separator.
def separator(fname)
f = File.open(fname)
enum = f.each_char
c = enum.next
loop do
case c[/\r|\n/]
when "\n" then break
when "\r"
c << "\n" if enum.peek=="\n"
break
end
c = enum.next
end
c[0][/\r|\n/] ? c : "\n"
end
Then process the file line-by-line
def process(fname)
sep = separator(fname)
IO.foreach(fname, sep) { |line| puts line }
end
I haven't converted "\r"
or "\r\n"
to "\n"
, but of course you could do that easily. Just open a file for writing and in process
read each line and write it to the output file with the default line separator.
Let's try it (for clarity I show the value returned by separator
):
fname = "temp"
IO.write(fname, "slash n line 1\nslash n line 2\n")
#=> 30
separator(fname)
#=> "\n"
process(fname)
# slash n line 1
# slash n line 2
IO.write(fname, "slash r line 1\rslash r line 2\r", )
#=> 30
separator(fname)
#=> "\r"
process(fname)
# slash r line 1
# slash r line 2
IO.write(fname, "slash r slash n line 1\r\nslash r slash n line 2\r\n")
#=> 48
separator(fname)
#=> "\r\n"
process(fname)
# slash r slash n line 1
# slash r slash n line 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With