Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if a string contains ASCII code

Tags:

string

ruby

utf-8

Given a string A\xC3B, it can be converted to utf-8 string by doing this (ref link):

"A\xC3B".force_encoding('iso-8859-1').encode('utf-8') #=> "AÃB"

However, I only want to perform the action if the string contains the ASCII code, namely \xC3. How can I check for that?

Tried "A\xC3B".include?("\x") but it doesn't work.

like image 443
sbs Avatar asked Jun 22 '15 21:06

sbs


People also ask

How do I know if text is ASCII?

A simple browser-based utility that validates ASCII data. Just paste your ASCII text in the input area and you will instantly get the ASCII status in the output area. If the input contains only ASCII characters, you'll get a green badge, otherwise a red badge.

How do you find the ASCII value of a string?

Approach: Start iterating through characters of the string and add their ASCII value to a variable. Finally, divide this sum of ASCII values of characters with the length of string i.e, the total number of characters in the string.

How do I check if a string contains Unicode characters?

To check if a given String contains only unicode letters, digits or space, we use the isLetterOrDigit() and charAt() methods with decision making statements. The isLetterOrDigit(char ch) method determines whether the specific character (Unicode ch) is either a letter or a digit.

How do you tell if a string contains a char?

The best method to check the character in a String is the indexOf() method. It will return the index of the character present in the String, while contains() method only returns a boolean value indicating the presence or absence of the specified characters.


2 Answers

\x is just a hexadecimal escape sequence. It has nothing to do with encodings on its own. US-ASCII goes from "\x00" to "\x7F" (e.g. "\x41" is the same as "A", "\x30" is "0"). The rest ("\x80" to "\xFF") however are not US-ASCII characters since it's a 7-bit character set.

If you want to check if a string contains only US-ASCII characters, call String#ascii_only?:

p "A\xC3B".ascii_only? # => false
p "\x41BC".ascii_only? # => true

Another example based on your code:

str = "A\xC3B"
unless str.ascii_only?
  str.force_encoding(Encoding::ISO_8859_1).encode!(Encoding::UTF_8)
end
p str.encoding # => #<Encoding:UTF-8>
like image 143
cremno Avatar answered Sep 25 '22 10:09

cremno


I think what you want to do is to figure out whether your string is properly encoded. The ascii_only? solution isn't much help when dealing with non-Ascii strings.

I would use String#valid_encoding? to verify whether a string is properly encoded, even if it contains non-ASCII chars.

For example, what if someone else has encoded "Françoise Paré" the right way, and when I decode it I get the right string instead of "Fran\xE7oise Par\xE9" (which is what would be decoded if someone encoded it into ISO-8859-1).

[62] pry(main)> "Françoise Paré".encode("utf-8").valid_encoding?
=> true

[63] pry(main)> "Françoise Paré".encode("iso-8859-1")
=> "Fran\xE7oise Par\xE9"

# Note the encoding is still valid, it's just the way IRB displays
# ISO-8859-1

[64] pry(main)> "Françoise Paré".encode("iso-8859-1").valid_encoding?
=> true

# Now let's interpret our 8859 string as UTF-8. In the following
# line, the string bytes don't change, `force_encoding` just makes
# Ruby interpret those same bytes as UTF-8.

[65] pry(main)> "Françoise Paré".encode("iso-8859-1").force_encoding("utf-8")
=> "Fran\xE7oise Par\xE9"

# Is a lone \xE7 valid UTF-8? Nope.

[66] pry(main)> "Françoise Paré".encode("iso-8859-1").force_encoding("utf-8").valid_encoding?
=> false
like image 23
Jonathan Allard Avatar answered Sep 23 '22 10:09

Jonathan Allard