In ruby 1.9.3, I can get the codepoints of a string:
> "foo\u00f6".codepoints.to_a
=> [102, 111, 111, 246]
Is there a built-in method to go the other direction, ie from integer array to string?
I'm aware of:
# not acceptable; only works with UTF-8
[102, 111, 111, 246].pack("U*")
# works, but not very elegant
[102, 111, 111, 246].inject('') {|s, cp| s << cp }
# concise, but I need to unshift that pesky empty string to "prime" the inject call
['', 102, 111, 111, 246].inject(:<<)
UPDATE (response to Niklas' answer)
Interesting discussion.
pack("U*")
always returns a UTF-8 string, while the inject
version returns a string in the file's source encoding.
#!/usr/bin/env ruby
# encoding: iso-8859-1
p [102, 111, 111, 246].inject('', :<<).encoding
p [102, 111, 111, 246].pack("U*").encoding
# this raises an Encoding::CompatibilityError
[102, 111, 111, 246].pack("U*") =~ /\xf6/
For me, the inject
call returns an ISO-8859-1 string, while pack
returns a UTF-8. To prevent the error, I could use pack("U*").encode(__ENCODING__)
but that makes me do extra work.
UPDATE 2
Apparently the String#<< doesn't always append correctly depending on the string's encoding. So it looks like pack is still the best option.
[225].inject(''.encode('utf-16be'), :<<) # fails miserably
[225].pack("U*").encode('utf-16be') # works
The most obvious adaption of your own attempt would be
[102, 111, 111, 246].inject('', :<<)
This is however not a good solution, as it only works if the initial empty string literal has an encoding that is capable of holding the entire Unicode character range. The following fails:
#!/usr/bin/env ruby
# encoding: iso-8859-1
p "\u{1234}".codepoints.to_a.inject('', :<<)
So I'd actually recommend
codepoints.pack("U*")
I don't know what you mean by "only works with UTF-8". It creates a Ruby string with UTF-8 encoding, but UTF-8 can hold the whole Unicode character range, so what's the problem? Observe:
irb(main):010:0> s = [0x33333, 0x1ffff].pack("U*")
=> "\u{33333}\u{1FFFF}"
irb(main):011:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):012:0> [0x33333, 0x1ffff].pack("U*") == [0x33333, 0x1ffff].inject('', :<<)
=> true
Depending on the values in your array and the value of Encoding.default_internal
, you might try:
[102, 111, 111, 246].map(&:chr).inject(:+)
You have to be careful of the encoding. Note the following:
irb(main):001:0> 0.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> 127.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> 128.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):005:0> 256.chr.encoding
RangeError: 256 out of char range
from (irb):5:in `chr'
from (irb):5
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):006:0>
By default, 256.chr fails because it likes to return either US-ASCII or ASCII-8BIT, depending on whether the codepoint is in 0..127 or 128..256.
This should cover your point for 8-bit values. If you have values larger than 255 (presumably Unicode codepoints), then you can do the following:
irb(main):006:0> Encoding.default_internal = "utf-8"
=> "utf-8"
irb(main):007:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> 256.chr.codepoints
=> [256]
irb(main):009:0>
With Encoding.default_internal set to "utf-8", Unicode values > 255 should work fine (but see below):
irb(main):009:0> 65535.chr.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> 65535.chr.codepoints
=> [65535]
irb(main):011:0> 65536.chr.codepoints
=> [65536]
irb(main):012:0> 65535.chr.bytes
=> [239, 191, 191]
irb(main):013:0> 65536.chr.bytes
=> [240, 144, 128, 128]
irb(main):014:0>
Now it gets interesting -- ASCII-8BIT and UTF-8 don't seem to mix:
irb(main):014:0> (0..127).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:US-ASCII>
irb(main):015:0> (0..128).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):016:0> (0..255).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):017:0> ((0..127).to_a + (256..1000000).to_a).map(&:chr).inject(:+).encoding
RangeError: invalid codepoint 0xD800 in UTF-8
from (irb):17:in `chr'
from (irb):17:in `map'
from (irb):17
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):018:0> ((0..127).to_a + (256..0xD7FF).to_a).map(&:chr).inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):019:0> (0..256).to_a.map(&:chr).inject(:+).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):19:in `+'
from (irb):19:in `each'
from (irb):19:in `inject'
from (irb):19
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):020:0>
ASCII-8BIT and UTF-8 can be concatenated, as long as the ASCII-8BIT codepoints are all in 0..127:
irb(main):020:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):021:0> (0.chr.force_encoding("ASCII-8BIT") + 256.chr).encoding
=> #<Encoding:UTF-8>
irb(main):022:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):023:0> (255.chr + 256.chr).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):23
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):024:0>
This brings us to an ultimate solution to your question:
irb(main):024:0> (0..0xD7FF).to_a.map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):025:0>
So I think the most general answer is, assuming you want UTF-8, is:
[102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
Assuming you know your values are in 0..255, then this is easier:
[102, 111, 111, 246].map(&:chr).inject(:+)
giving you:
irb(main):027:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
=> "fooö"
irb(main):028:0> [102, 111, 111, 246].map(&:chr).inject(:+)
=> "foo\xF6"
irb(main):029:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):030:0> [102, 111, 111, 246].map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):031:0>
I hope this helps (albeit a bit late, perhaps) -- I found this looking for an answer to the same question, so I researched it myself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With