This Iconv idiom transcodes a string to UTF-8 and drops characters that can't be transliterated:
require "iconv"
def normalize(text)
Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv(text.dup)
end
How would you actually write a test for this?
Edit: I ended up simplifying the question since I realized the context of trying to test this in a Rails # encoding: utf-8
spec file was complicating the issue. So now the bounty is kind of silly but I'll reward it anyways if someone can show a test I can work off of.
You can construct Strings from a byte array using the #pack method. This way, you can easily generate an invalid/bad string and use it in a test.
Example:
describe "#normalize" do
it "should remove/ignore invalid characters" do
# this "string" equals "Mandados de busca do caso Megaupload considerados inv\xE1lidos - Tecnologia - Sol"
bad_string = [77, 97, 110, 100, 97, 100, 111, 115, 32, 100, 101, 32, 98, 117, 115, 99, 97, 32, 100, 111, 32, 99, 97, 115, 111, 32, 77, 101, 103, 97, 117, 112, 108, 111, 97, 100, 32, 99, 111, 110, 115, 105, 100, 101, 114, 97, 100, 111, 115, 32, 105, 110, 118, 225, 108, 105, 100, 111, 115, 32, 45, 32, 84, 101, 99, 110, 111, 108, 111, 103, 105, 97, 32, 45, 32, 83, 111, 108].pack('c*').force_encoding('UTF-8')
normalize(bad_string).should == 'Mandados de busca do caso Megaupload considerados invlidos - Tecnologia - Sol'
end
end
(I'm sorry for the rather long test string, I just couldn't find a shorter example in my code)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With