The official way to convert between string encodings as of Ruby 1.9 is to use String#encode.
To simply remove non-ASCII characters, you could do this:
some_ascii = "abc"
some_unicode = "áëëçüñżλφθΩ𠜎😸"
more_ascii = "123ABC"
invalid_byte = "\255"
non_ascii_string = [some_ascii, some_unicode, more_ascii, invalid_byte].join
# See String#encode documentation
encoding_options = {
:invalid => :replace, # Replace invalid byte sequences
:undef => :replace, # Replace anything not defined in ASCII
:replace => '', # Use a blank for those replacements
:universal_newline => true # Always break lines with \n
}
ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
# => "abce123ABC"
Notice that the first 5 characters in the result are "abce1" - the "á" was discarded, one "ë" was discarded, but another "ë" appears to have been converted to "e".
The reason for this is that there are sometimes multiple ways to express the same written character in Unicode. The "á" is a single Unicode codepoint. The first "ë" is, too. When Ruby sees these during this conversion, it discards them.
But the second "ë" is two codepoints: a plain "e", just like you'd find in an ASCII string, followed by a "combining diacritical mark" (this one), which means "put an umlaut on the previous character". In the Unicode string, these are interpreted as a single "grapheme", or visible character. When converting this, Ruby keeps the plain ASCII "e" and discards the combining mark.
If you decide you'd like to provide some specific replacement values, you could do this:
REPLACEMENTS = {
'á' => "a",
'ë' => 'e',
}
encoding_options = {
:invalid => :replace, # Replace invalid byte sequences
:replace => "", # Use a blank for those replacements
:universal_newline => true, # Always break lines with \n
# For any character that isn't defined in ASCII, run this
# code to find out how to replace it
:fallback => lambda { |char|
# If no replacement is specified, use an empty string
REPLACEMENTS.fetch(char, "")
},
}
ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
#=> "abcaee123ABC"
Some have reported issues with the :universal_newline
option. I have seen this intermittently, but haven't been able to track down the cause.
When it happens, I see Encoding::ConverterNotFoundError: code converter not found (universal_newline)
. However, after some RVM updates, I've just run the script above under the following Ruby versions without problems:
Given this, it doesn't appear to be a deprecated feature or even a bug in Ruby. If anyone knows the cause, please comment.
class String
def remove_non_ascii(replacement="")
self.gsub(/[\u0080-\u00ff]/, replacement)
end
end
Here's my suggestion using Iconv.
class String
def remove_non_ascii
require 'iconv'
Iconv.conv('ASCII//IGNORE', 'UTF8', self)
end
end
If you have active support you can use I18n.transliterate
I18n.transliterate("áëëçüñżλφθΩ𠜎")
"aee?cunz?????"
Or if you don't want the question marks...
I18n.transliterate("áëëçüñżλφθΩ𠜎", replacement: "")
"aeecunz"
Note that this doesn't remove invalid byte sequences it just replaces non ascii characters. For my use case this was what I wanted though and was simple.
With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.
The trick is to break down each character into its own separate block so that ruby can fail.
Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.
So, given a "dirty" string, lets say you used File#read
on a picture. (my case)
dirty = File.open(filepath).read
clean_chars = dirty.chars.select do |c|
begin
num_or_letter?(c)
rescue ArgumentError
next
end
end
clean = clean_chars.join("")
def num_or_letter?(char)
if char =~ /[a-zA-Z0-9]/
true
elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
true
end
end
class String
def strip_control_characters
self.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join
end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With