Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WebSockets and text encoding

I read:

The WebSocket API accepts a DOMString object, which is encoded as UTF-8 on the wire, or one of ArrayBuffer, ArrayBufferView, or Blob objects for binary transfers.

A DOMString is a UTF-16 encoded string. So is it correct that UTF-8 encoding is used over the wire?

like image 755
Ben Aston Avatar asked Apr 20 '17 20:04

Ben Aston


1 Answers

Yes, it is correct.

UTF-16 may or may not be used in memory, that is just an implementation detail of whatever framework you are using. In the case of JavaScript, strings are UTF-16.

For WebSocket communications, UTF-8 must be used over the wire for textual data (most Internet protocols use UTF-8 nowadays). That is dictated by the WebSocket protocol specification:

After a successful handshake, clients and servers transfer data back and forth in conceptual units referred to in this specification as "messages". On the wire, a message is composed of one or more frames. The WebSocket message does not necessarily correspond to a particular network layer framing, as a fragmented message may be coalesced or split by an intermediary.

A frame has an associated type. Each frame belonging to the same message contains the same type of data. Broadly speaking, there are types for textual data (which is interpreted as UTF-8 [RFC3629] text), binary data (whose interpretation is left up to the application), and control frames (which are not intended to carry data for the application but instead for protocol-level signaling, such as to signal that the connection should be closed). This version of the protocol defines six frame types and leaves ten reserved for future use.

...

Data frames (e.g., non-control frames) are identified by opcodes where the most significant bit of the opcode is 0. Currently defined opcodes for data frames include 0x1 (Text), 0x2 (Binary). Opcodes 0x3-0x7 are reserved for further non-control frames yet to be defined.

Data frames carry application-layer and/or extension-layer data. The opcode determines the interpretation of the data:

Text

The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Section 8.1.

Binary

The "Payload data" is arbitrary binary data whose interpretation is solely up to the application layer.

You will incure a small amount of overhead converting from UTF-16 to UTF-8 to UTF-16, but the overhead is minimal on modern machines, and conversions between UTFs are lossless.

like image 145
Remy Lebeau Avatar answered Sep 21 '22 23:09

Remy Lebeau