From a gem, I get a string s
that has latin-1
-encoded content and that I want to store in a Rails model.
r = MyRecord.new(mystring: s)
# ...
r.save
Because my PostgreSQL database uses UTF-8
encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:
ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65
...
I can solve this easily by transcoding the string:
r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
# ...
r.save
(Because r.encoding
returns #<Encoding:ASCII-8BIT>
instead of #<Encoding:ISO-8859-1>
, I'm passing the source encoding as the second argument. The gem that produced s
probably isn't aware that the file it read the string from is latin1
encoded.)
It occurred to me, that knowledge about the database's string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.
I can ask the model's class for the database's encoding:
MyRecord.connection.encoding
This doesn't return a Ruby Encoding
object though, it returns a string containing the encoding's name. Fortunately, the Encoding
class can be queried with names (and some aliases) to look up encodings:
Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8
Unfortunately, different naming conventions are used: MyRecord.connection.encoding
returns 'UTF8'
(no minus sign) while Encoding.find(...)
needs to be passed 'UTF-8'
(with minus sign) or 'CP65001'
if we want it to return #<Encoding:UTF-8>
.)
Sooooo close.
to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database's encoding for that?
I don't feel doing string manipulation or pattern matching on the result of MyRecord.connection.encoding
or on the contents of Encoding.aliases()
would be any better than just leaving the hard-coded values in the code.
Modifying Encoding.aliases()
's return value doesn't have any effect:
Encoding.aliases['UTF8'] = 'UTF-8'
Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8
(and doesn't feel right either, anyway), nor does modifying the return value of #names
:
Encoding::UTF_8.names.push('UTF8')
Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8
I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.
The simplest and, arguably, cleanest solution to this problem would be to not call Encoding.find
directly, but have an utility method (perhaps in a module located at lib/yourapp
) which knows about the encoding name differences you care about and falls back to Encoding.find
for all other inputs:
module YourApp
module DatabaseStringEncoding
def find(name)
case name
when 'UTF8'
Encoding::UTF_8
...
else
Encoding.find(name)
end
end
end
This is both easy to understand and discover (as opposed to modifying Encoding
directly, which is not visible to the reader of the code which does the encoding). Based on such a find
method, you could then go further and implement a method which automatically recodes a string to the database's string encoding using YourRecord.connection.encoding
.
I know it would be more exciting to get Encoding.find
to do exactly what you want, but I would argue that this "dumber" approach would actually be the better one. :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With