Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force strings to UTF-8 from any encoding

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.

How can I detect encoding and convert to UTF-8?

like image 960
Hayk Saakian Avatar asked Oct 18 '12 05:10

Hayk Saakian


People also ask

How do I change the encoding of a string in Java?

Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.

Are Java strings UTF-8?

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.

Is Python a UTF-8 string?

By default, Python uses utf-8 encoding.

What is Standardcharsets utf_8 in Java?

Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.


2 Answers

Ruby 1.9

"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:

str = str.force_encoding('UTF-8')  str.encoding.name # => 'UTF-8' 

If you want to perform a conversion, use encode:

begin   str.encode("UTF-8") rescue Encoding::UndefinedConversionError   # ... end 

I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string

like image 58
kwarrick Avatar answered Oct 24 '22 09:10

kwarrick


This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.

This will ensure no matter what, that you have a valid UTF-8 string

str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''}) 
like image 33
John Pollard Avatar answered Oct 24 '22 08:10

John Pollard