Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby string encoding problem

Tags:

ruby

I've looked at the other ruby/encoding related posts but haven't been able to figure out why the following is not working. Likely just because I'm dense, but here's the situation.

Using Ruby 1.9 on windows. I have a set of CSV files that need some data appended to the end of each line. Whenever I run my script, the appended characters are gibberish. The input text appears to be IBM437 encoding, whereas my string I'm appending starts as US-ASCII. Nothing I've tried with respect to forcing encoding on the input strings or the append string seems to change the resultant output. I'm stumped. The current encoding version is simply the last that I tried.

def append_salesperson(txt, salesperson)
  if txt.length > 2
    return txt.chomp.force_encoding('US-ASCII') + %(, "", "", "#{salesperson}")
  end
end

salespeople = Hash[
    "fname", "Record Manager"]

outfile = File.open("ActData.csv", "w:US-ASCII")

salespeople.each do | filename, recordManager |
  infile = File.open("#{filename}.txt")
  infile.each do |line|
    outfile.puts append_salesperson(line, recordManager)
  end
  infile.close
end
outfile.close
like image 592
John Prideaux Avatar asked Feb 19 '10 15:02

John Prideaux


People also ask

How do you encode a string in Ruby?

Ruby has the method Encoding. default_external which defines what the current operating systems default encoding is. Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8.

What is the difference between ISO 8859 1 and UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

How many bytes is a character in Ruby?

In UTF-8, the default internal string encoding in Ruby (and many other languages), characters are variable width, requiring 1 - 4 bytes each.


1 Answers

One small note that is related to your question is that you have your csv data as such %(, "", "", "#{salesperson}"). Here you have a space char before your double quotes. This can cause the #{salesperson} to be interpreted as multiple fields if there is a comma in this text. To fix this there can't be white space between the comma and the double quotes. Example: "this is a field","Last, First","and so on". This is one little gotcha that I ran into when creating reports meant to be viewed in Excel.

In Common Format and MIME Type for Comma-Separated Values (CSV) Files they describe the grammar of a csv file for reference.

like image 117
lillq Avatar answered Oct 13 '22 17:10

lillq