Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Ruby, How to read UTF-8 from a socket?

When a server sends UTF-8 bytes, how do you read them without characters becoming pure bytes? (\x40 etc)

like image 330
lcarpenter Avatar asked Jun 27 '12 12:06

lcarpenter


People also ask

Can UTF-8 be read as Ascii?

UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.

Is UTF-8 a string?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names.

What is the value in UTF-8?

UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

What is a UTF-8 sequence?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8.


2 Answers

I believe read_nonblock uses read, which in turn says:

The resulted string is always ASCII-8BIT encoding.

Which means you don't need to specify IO#set_encoding, but that you can, after you read whole string, force its encoding (using String#force_encoding!) to UTF-8.

I emphasized 'whole', as you need to make sure that you read entire Unicode character at the end of the string, as if only part of it is read, you will get invalid UTF-8 character and Ruby might complain about it further down the line.

like image 181
Mladen Jablanović Avatar answered Oct 12 '22 20:10

Mladen Jablanović


You can use IO#set_encoding to set a socket's external encoding to UTF-8.

#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

require 'socket'

server_socket = TCPServer.new('localhost', 0)
Thread.new do
  loop do
    session_socket = server_socket.accept
    session_socket.set_encoding 'ASCII-8BIT'  
    session_socket.puts "  ᚁ ᚂ ᚃ ᚄ ᚅ ᚆ ᚇ ᚈ ᚉ ᚊ ᚋ ᚌ ᚍ"
    session_socket.close
  end
end

client_socket = TCPSocket.new('localhost', server_socket.addr[1])
client_socket.set_encoding 'UTF-8'
p client_socket.gets
# => "|  ᚁ ᚂ ᚃ ᚄ ᚅ ᚆ ᚇ ᚈ ᚉ ᚊ ᚋ ᚌ ᚍ\n"
like image 4
Wayne Conrad Avatar answered Oct 12 '22 20:10

Wayne Conrad