I get sources from the web and sometimes the encoding of the material is not 100% UTF8 byte sequence valid. I use iconv to silently ignore these sequences to get a cleaned string.
@iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = @iconv.iconv(untrusted_string)
However now the iconv has been deprecated, I see its deprecation warning a lot.
iconv will be deprecated in the future, use String#encode
I tried the converting it, using String#encode
's :invalid
and :replace
options, but it seems not to be working (i.e. the incorrect byte sequence has not been removed). What is the correct way to use String#encode for this?
This has been answered in this question:
Is there a way in ruby 1.9 to remove invalid byte sequences from strings?
Use either
untrusted_string.chars.select{|i| i.valid_encoding?}.join
or
untrusted_string.encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With