Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete non-UTF characters from a string in Ruby?

Tags:

string

regex

ruby

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.

This:

text.gsub!(/\xC2/, '') 

returns an error:

incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) 

I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.

like image 274
Wojtek B. Avatar asked Aug 27 '12 18:08

Wojtek B.


2 Answers

You can use encode for that. text.encode('UTF-8', :invalid => :replace, :undef => :replace)

For more info look into Ruby-Docs

like image 105
Iuri G. Avatar answered Oct 14 '22 22:10

Iuri G.


You could do it like this

# encoding: utf-8  class String   def validate_encoding     chars.select(&:valid_encoding?).join    end end  puts "testing\xC2 a non UTF-8 string".validate_encoding #=>testing a non UTF-8 string 
like image 42
peter Avatar answered Oct 14 '22 23:10

peter