In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.
How can I detect encoding and convert to UTF-8?
Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.
A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
By default, Python uses utf-8 encoding.
Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8') str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode
:
begin str.encode("UTF-8") rescue Encoding::UndefinedConversionError # ... end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.
This will ensure no matter what, that you have a valid UTF-8 string
str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With