Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Charset in data URI

Over the years from reading the evolving specs I had assumed that RFC 3986 had finally settled on UTF-8 encoding for escape octet sequences. That is, if my URI has %XX%YY%ZZ I can take that sequence of decoded octets (for any URI in the scheme-specific part) and interpret the resulting bytes as UTF-8 to find out what decoded information was intended. In practical terms, I can call JavaScript decodeURIComponent() which does this decoding automatically for me.

Then I read the spec for data: URIs, RFC 2397, which includes a charset argument, which (naturally) indicates the charset of the encoded data. But how does that work? If I have a two-octet encoded sequence %XX%YY in my data: URI, does a charset=iso-8859-1 indicate that the two decoded octects should not be interpreted as a UTF-8 sequence, but as as two separate Latin characters (as each byte in ISO-8859-1 represents a character)? RFC 2397 seems to indicate this, as it gives an example of "greek [sic] characters":

data:text/plain;charset=iso-8859-7,%be%fg%be

But this means that JavaScript decodeURIComponent() (which assumes UTF-8 encoded octets) can't be used to extract a string from a data URI, correct? Does this mean I have to create my own decoding for data URIs if the charset is something besides UTF-8?

Furthermore, does this mean that RFC 2397 is now in conflict with RFC 3986, which seems to indicate that UTF-8 is assumed? Or does RFC 3986 only refer "new URI scheme[s]", meaning that the data: URI scheme gets grandfathered in and has its own technique for specifying what the encoded octets means?

My best guess at the moment is that data: plays by its own rules and if it indicates a charset other than UTF-8, I'll have to use something other than decodeURIComponent() in JavaScript. Any recommendations on a replacement method would be welcome, too.

like image 332
Garret Wilson Avatar asked May 25 '13 18:05

Garret Wilson


People also ask

What is CSS URI data?

A Data URI (Uniform Resource Identifier) is a scheme that allows data to be encoded into a string, and then embedded directly into HTML or CSS. The more commonly known URL (Uniform Resource Locator) is a subset of Data URI that specifically identifies a resource's location, such as the IP address of a website.

How does data URI work?

A data URI is a base64 encoded string that represents a file. Getting the contents of a file as a string means that you can directly embed the data within your HTML or CSS code. When the browser encounters a data URI in your code, it's able to decode the data and construct the original file.

What is a charset UTF 8?

charset = character set utf-8 is character encoding capable of encoding all characters on the web. It replaced ascii as the default character encoding. Because it is the default all modern browsers will use utf-8 without being explicitly told to do so. It remains in meta data as a common good practice.

What is data URI of image?

Introduction. A Data URL is a URI scheme that provides a way to inline data in an HTML document. Say you want to embed a small image. You could go the usual way, upload it to a folder and use the img tag to make the browser reference it from the network: <img src="image.png" />


1 Answers

Remember that the data: URI scheme describes a resource that can be thought of as a file which consists of an opaque bytestream just as though it were a http: URI (the same bytestream, but stored on an HTTP server) or an ftp: URI (the same bytestream, but stored on an FTP server) or a file: URI (the same bytestream, but stored on your local filesystem). Only the metadata attached to the file gives the bytestream meaning.

RFC 2397 gives a clear specification on how this bytestream is to be embedded in the URI itself (in contrast to other URI schemes, where the URI gives instructions on where to fetch the bytestream, not what it contains). It might be base64 or it might be the percent-encoding method given in the RFC. Base64 is going to be more compact if the bytestream contains man non-ASCII bytes.

The data: URI also describes its own Content-Type, which gives the intended interpretation of the bytestream. In this case, since you have used text/plain;charset=iso-8859-7, the bytes must be correctly encoded ISO-8859-7 text. The bytes will definitely not be decided as UTF-8 or any other character encoding. It will be unambiguously decoded using the character encoding you have specified.

like image 176
Celada Avatar answered Sep 17 '22 01:09

Celada