Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?

More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?

like image 501
Markus A. Avatar asked Sep 23 '14 21:09

Markus A.


2 Answers

GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.

If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).

like image 54
Peter Knego Avatar answered Sep 27 '22 15:09

Peter Knego


I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:

  • Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
  • The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
  • I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
  • Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage

From this, I would derive the following recommendations:

  • If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
  • If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data

Asides:

I will accept @PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.

From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

like image 24
Markus A. Avatar answered Sep 27 '22 15:09

Markus A.